tensorflow.keras.preprocessing.text.Tokenizer 的文本编码与旧 tfds.deprecated.text.TokenTextEncoder 的文本编码有何不同

Question

tfds.deprecated.text.TokenTextEncoder

在已弃用的编码方法中 tfds.deprecated.text.TokenTextEncoder 我们首先创建一个词汇集 token

tokenizer = tfds.deprecated.text.Tokenizer()
vocabulary_set = set()

#imdb_train --> imdb dataset from tensorflow_datasets
for example, label in imdb_train:
    some_tokens = tokenizer.tokenize(example.numpy())

然后载入编码器

encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set,
                                                   lowercase=True,
                                                   tokenizer=tokenizer)

之后在执行编码时我注意到编码器将输出一个整数，例如在调试时我发现单词“the”被编码为 112

 token_id = encoder.encode(word)[0]
>> token_id = 112

但是说到

tensorflow.keras.preprocessing.text.Tokenizer

tokenizer = tensorflow.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(words)
token_id = tokenizer.texts_to_sequences(word) #word = the
>> token_id = [800,2085,936]

它产生了一个由 3 个整数组成的序列，那么现在我是使用所有 3 个数字，还是如果我只取该序列中的 1 个数字是否也正确？我正在尝试使用这个编码整数来使用 Glove Embedding 创建嵌入矩阵。旧的已弃用的只产生一个整数，因此更容易映射，使用整数序列我不确定如何进行

Answer 1

也许可以试试这样：

import tensorflow as tf

lines = ['You are a fish', 'This is a fish', 'Where are the fishes']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)
text_sequences = tokenizer.texts_to_sequences(lines)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences, padding='post')
vocab_size = len(tokenizer.word_index) + 1
print(tokenizer.word_index)
print(vocab_size)
print(tokenizer.texts_to_sequences(['fish'])[0])

{'are': 1, 'a': 2, 'fish': 3, 'you': 4, 'this': 5, 'is': 6, 'where': 7, 'the': 8, 'fishes': 9}
10
[3]

索引 0 是为填充令牌保留的。然后用 Glove 模型创建权重矩阵，试试这个：

import gensim.downloader as api
import numpy as np

model = api.load("glove-twitter-25")
embedding_dim = 25
weight_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
  try:
    embedding_vector = model[word]
    weight_matrix[i] = embedding_vector
  except KeyError:
    weight_matrix[i] = np.random.uniform(-5, 5, embedding_dim)
print(weight_matrix.shape)
# (10, 25)

tensorflow.keras.preprocessing.text.Tokenizer 的文本编码与旧 tfds.deprecated.text.TokenTextEncoder 的文本编码有何不同

How does text encoding from tensorflow.keras.preprocessing.text.Tokenizer differ from the old tfds.deprecated.text.TokenTextEncoder

python

nlp

keras

tensorflow

tensorflow2.0

tfds.deprecated.text.TokenTextEncoder

tensorflow.keras.preprocessing.text.Tokenizer