如何使用张量流标记文本?

How to tokenize a text using tensorflow?

我正在尝试使用以下代码来向量化一个句子:

from tensorflow.keras.layers import TextVectorization

text_vectorization_layer =  TextVectorization(max_tokens=10000,
                                              ngrams=5,
                                              standardize='lower_and_strip_punctuation',
                                              output_mode='int',
                                              output_sequence_length = 15
                                              )

text_vectorization_layer(['BlackBerry Limited is a Canadian software'])

但是,它报错如下:

AttributeError: 'NoneType' object has no attribute 'ndims'

您必须首先使用 adapt 方法或将词汇表数组传递给层的 vocabulary 参数来计算 TextVectorization 层的词汇表。这是一个工作示例:

import tensorflow as tf

text_vectorization_layer =  tf.keras.layers.TextVectorization(max_tokens=10000,
                                              ngrams=5,
                                              standardize='lower_and_strip_punctuation',
                                              output_mode='int',
                                              output_sequence_length = 15
                                              )

text_vectorization_layer.adapt(['BlackBerry Limited is a Canadian software'])
print(text_vectorization_layer(['BlackBerry Limited is a Canadian software']))
tf.Tensor([[18  7 11 21 13  2 17  6 10 20 12 16  5  9 19]], shape=(1, 15), dtype=int64)

字符串在内部被标记化。另外,检查 docs.