如何在 tf.data.TextLineDataset 上应用 tf.keras.preprocessing.text.Tokenizer?
How to apply tf.keras.preprocessing.text.Tokenizer on tf.data.TextLineDataset?
我正在加载一个 TextLineDataset
并且我想应用一个在文件上训练过的分词器:
import tensorflow as tf
data = tf.data.TextLineDataset(filename)
MAX_WORDS = 20000
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts([x.numpy().decode('utf-8') for x in train_data])
现在我想在 data
上应用这个分词器,以便每个单词都被其编码值替换。我试过 data.map(lambda x: tokenizer.texts_to_sequences(x))
给出 OperatorNotAllowedInGraphError: iterating over
tf.Tensor is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.
按照说明,当我将代码编写为:
@tf.function
def fun(x):
return tokenizer.texts_to_sequences(x)
train_data.map(lambda x: fun(x))
我得到:OperatorNotAllowedInGraphError: iterating over
tf.Tensor is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature
。
那么如何在 data
上进行标记化?
问题是 tf.keras.preprocessing.text.Tokenizer
不适用于图表模式。检查 docs,fit_on_texts
和 texts_to_sequences
都需要字符串列表而不是张量。我建议使用 tf.keras.layers.TextVectorization
,但如果您真的想使用 Tokenizer
方法,请尝试这样的操作:
import tensorflow as tf
import numpy as np
with open('data.txt', 'w') as f:
f.write('this is a very important sentence \n')
f.write('where is my cat actually?\n')
f.write('fish are everywhere!\n')
dataset = tf.data.TextLineDataset(['/content/data.txt'])
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts([n.numpy().decode("utf-8")for n in list(dataset.map(lambda x: x))])
def tokenize(x):
return tokenizer.texts_to_sequences([x.numpy().decode("utf-8")])
dataset = dataset.map(lambda x: tf.py_function(tokenize, [x], Tout=[tf.int32])[0])
for d in dataset:
print(d)
tf.Tensor([2 1 3 4 5 6], shape=(6,), dtype=int32)
tf.Tensor([ 7 1 8 9 10], shape=(5,), dtype=int32)
tf.Tensor([11 12 13], shape=(3,), dtype=int32)
使用 TextVectorization
图层看起来像这样:
with open('data.txt', 'w') as f:
f.write('this is a very important sentence \n')
f.write('where is my cat actually?\n')
f.write('fish are everywhere!\n')
dataset = tf.data.TextLineDataset(['/content/data.txt'])
vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int')
vectorize_layer.adapt(dataset)
dataset = dataset.map(vectorize_layer)
我正在加载一个 TextLineDataset
并且我想应用一个在文件上训练过的分词器:
import tensorflow as tf
data = tf.data.TextLineDataset(filename)
MAX_WORDS = 20000
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts([x.numpy().decode('utf-8') for x in train_data])
现在我想在 data
上应用这个分词器,以便每个单词都被其编码值替换。我试过 data.map(lambda x: tokenizer.texts_to_sequences(x))
给出 OperatorNotAllowedInGraphError: iterating over
tf.Tensor is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.
按照说明,当我将代码编写为:
@tf.function
def fun(x):
return tokenizer.texts_to_sequences(x)
train_data.map(lambda x: fun(x))
我得到:OperatorNotAllowedInGraphError: iterating over
tf.Tensor is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature
。
那么如何在 data
上进行标记化?
问题是 tf.keras.preprocessing.text.Tokenizer
不适用于图表模式。检查 docs,fit_on_texts
和 texts_to_sequences
都需要字符串列表而不是张量。我建议使用 tf.keras.layers.TextVectorization
,但如果您真的想使用 Tokenizer
方法,请尝试这样的操作:
import tensorflow as tf
import numpy as np
with open('data.txt', 'w') as f:
f.write('this is a very important sentence \n')
f.write('where is my cat actually?\n')
f.write('fish are everywhere!\n')
dataset = tf.data.TextLineDataset(['/content/data.txt'])
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts([n.numpy().decode("utf-8")for n in list(dataset.map(lambda x: x))])
def tokenize(x):
return tokenizer.texts_to_sequences([x.numpy().decode("utf-8")])
dataset = dataset.map(lambda x: tf.py_function(tokenize, [x], Tout=[tf.int32])[0])
for d in dataset:
print(d)
tf.Tensor([2 1 3 4 5 6], shape=(6,), dtype=int32)
tf.Tensor([ 7 1 8 9 10], shape=(5,), dtype=int32)
tf.Tensor([11 12 13], shape=(3,), dtype=int32)
使用 TextVectorization
图层看起来像这样:
with open('data.txt', 'w') as f:
f.write('this is a very important sentence \n')
f.write('where is my cat actually?\n')
f.write('fish are everywhere!\n')
dataset = tf.data.TextLineDataset(['/content/data.txt'])
vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int')
vectorize_layer.adapt(dataset)
dataset = dataset.map(vectorize_layer)