Tensorflow TextVectorization adapt()——检查生成的词汇表

Tensorflow TextVectorization adapt() -- checking the produced vocabulary

文字TextVectorization层用于文字编码,典型工作流调用adapt()方法

Next, you will call adapt to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

(https://www.tensorflow.org/tutorials/keras/text_classification)

If desired, the user can call this layer's adapt() method on a dataset. When this layer is adapted, it will analyze the dataset, determine the frequency of individual string values, and create a 'vocabulary' from them.

(https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization#adapt)

adapt()操作的结果究竟是什么,如何具体查看创建的词汇表的内容?

我的一小段代码

seq_length = 100
vocab_size=50000

vectorize_layer = TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=seq_length)

# build the vocabulary
vectorize_layer.adapt(text_ds)

layer.get_vocabulary() 这样做:

>>>data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
>>>layer = tf.keras.layers.StringLookup()
>>>layer.adapt(data)
>>>layer.get_vocabulary()

['[UNK]', 'd', 'z', 'c', 'b', 'a']

https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup