如何在 tensorflow 数据集 API 中使用自定义函数?

How to use custom function with tensorflow dataset API?

我是 TensorFlow tf.data.Dataset 的新手,我正尝试在我使用 pandas 数据框加载的数据上使用它,如下所示:

加载输入日期(df_input):

    id               messages  Label
0   11  I am not driving home   0
1   11      Please pick me up   1
2  103   The car already park   1
3  103     No need for ticket   0
4  104       I will buy a car   1
5  104       I will buy truck   1

我按如下方式预处理和应用文本矢量化:

text_vectorizer = layers.TextVectorization(max_tokens=20, output_mode="int", output_sequence_length=6)
text_vectorizer.adapt(df_input.message.values.tolist())

def encode(texts):
    encoded_texts = text_vectorizer(texts)
    return encoded_texts.numpy()

train_data = encode(df_input.message.values) ## This the training data
train_label = tf.keras.utils.to_categorical(df_input.label.values, 2) ## This labels

然后我使用 TensorFlow tf.data.Dataset 函数在训练模型中使用预处理数据,如下所示:

train_dataset_df = (
    tf.data.Dataset.from_tensor_slices((train_data, train_label))
    .shuffle(1000)
    .batch(2)
    )

我的问题是我如何转换每次训练纪元中的数据我对训练数据的自定义功能。我从 here to this post:

中看到了通过 .map 函数执行转换的用法示例
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))

我的目标是按如下方式应用我的自定义函数(重新排序文本数据中的单词):

def order_augment_sent(Sentence):
    words = Sentence.split(" ")
    words.sort()
    newSentence = " ".join(words)
    return newSentence


train_dataset_ds = (
    tf.data.Dataset.from_tensor_slices((train_data, train_label))
    .shuffle(1000)
    .batch(2)
    .map(lambda x, y: (order_augment_sent(x), y))
    )

但是我得到的错误是:

AttributeError: 'Tensor' object has no attribute 'split'

或者如果我应用我的其他自定义函数,我得到的是:

TypeError: To be compatible with tf.function, Python functions must return zero or more Tensors or ExtensionTypes or None values; in compilation of <function _tf_if_stmt.<locals>.aug_body at 0124f565>, found return value of type WarningException, which is not a Tensor or ExtensionType.

我不确定我该怎么做,如果您有任何想法或解决方案可以帮助我,我将不胜感激。

您在 lambda 函数中获得的参数是向量中的标记,因此它们是整数。如果要对文本数据重新排序,需要在 text_vectorizer.

之前进行

所以您应该将 TextVectorization 层添加到您的模型中,这样您的地图函数就会有字符串,并且您可以在调用 TextVectorization 之前重新排序句子。

这是一个几乎可以工作的例子,你只需要用你需要的代码编辑 order_augment_sent 函数,我不知道你想做什么样的排序,可能你必须写使用 numpy https://www.tensorflow.org/api_docs/python/tf/py_function

的自定义排序
import tensorflow as tf
import numpy as np

train_data = ["I am not driving home", "Please pick me up", "The car already park", " No need for ticket", "I will buy a car", "I will buy truck"]
train_label = [0,1,1,0,1,1]

text_dataset = tf.data.Dataset.from_tensor_slices(train_data)
max_features = 5000  # Maximum vocab size.
max_len = 4  # Sequence length to pad the outputs to.

# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
 max_tokens=max_features,
 output_mode='int',
 output_sequence_length=max_len)

# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for large
# datasets this means we're not keeping spare copies of the dataset.
vectorize_layer.adapt(train_data)

# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()

# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))

# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
model.add(vectorize_layer)

def apply_order_augment_sent(s):
    Sentence = s.decode('utf-8')
    words = Sentence.split(" ")
    words.sort()
    newSentence = " ".join(words)
    return(newSentence)

def order_augment_sent(x: np.ndarray, y:np.ndarray):
    new_x = []
    for i in range(len(x)):
      new_x.append(np.array([apply_order_augment_sent(x[i])]))
      
    print('new', new_x, y)
    return(new_x, y)



train_dataset_ds = tf.data.Dataset.from_tensor_slices((train_data, train_label))
train_dataset_ds = train_dataset_ds.shuffle(1000).batch(32)
train_dataset_ds = train_dataset_ds.map(lambda item1, item2: tf.numpy_function(
          order_augment_sent, [item1, item2], [tf.string, tf.int32]))

list(train_dataset_ds.as_numpy_iterator())


model.predict(train_dataset_ds)