tensorflow.keras.preprocessing.text.Tokenizer 的文本编码与旧 tfds.deprecated.text.TokenTextEncoder 的文本编码有何不同
How does text encoding from tensorflow.keras.preprocessing.text.Tokenizer differ from the old tfds.deprecated.text.TokenTextEncoder
tfds.deprecated.text.TokenTextEncoder
在已弃用的编码方法中 tfds.deprecated.text.TokenTextEncoder
我们首先创建一个词汇集 token
tokenizer = tfds.deprecated.text.Tokenizer()
vocabulary_set = set()
#imdb_train --> imdb dataset from tensorflow_datasets
for example, label in imdb_train:
some_tokens = tokenizer.tokenize(example.numpy())
然后载入编码器
encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set,
lowercase=True,
tokenizer=tokenizer)
之后在执行编码时我注意到编码器将输出一个整数,例如在调试时我发现单词“the”被编码为 112
token_id = encoder.encode(word)[0]
>> token_id = 112
但是说到
tensorflow.keras.preprocessing.text.Tokenizer
tokenizer = tensorflow.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(words)
token_id = tokenizer.texts_to_sequences(word) #word = the
>> token_id = [800,2085,936]
它产生了一个由 3 个整数组成的序列,那么现在我是使用所有 3 个数字,还是如果我只取该序列中的 1 个数字是否也正确?我正在尝试使用这个编码整数来使用 Glove Embedding 创建嵌入矩阵。旧的已弃用的只产生一个整数,因此更容易映射,使用整数序列我不确定如何进行
也许可以试试这样:
import tensorflow as tf
lines = ['You are a fish', 'This is a fish', 'Where are the fishes']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)
text_sequences = tokenizer.texts_to_sequences(lines)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences, padding='post')
vocab_size = len(tokenizer.word_index) + 1
print(tokenizer.word_index)
print(vocab_size)
print(tokenizer.texts_to_sequences(['fish'])[0])
{'are': 1, 'a': 2, 'fish': 3, 'you': 4, 'this': 5, 'is': 6, 'where': 7, 'the': 8, 'fishes': 9}
10
[3]
索引 0 是为填充令牌保留的。然后用 Glove 模型创建权重矩阵,试试这个:
import gensim.downloader as api
import numpy as np
model = api.load("glove-twitter-25")
embedding_dim = 25
weight_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
try:
embedding_vector = model[word]
weight_matrix[i] = embedding_vector
except KeyError:
weight_matrix[i] = np.random.uniform(-5, 5, embedding_dim)
print(weight_matrix.shape)
# (10, 25)
tfds.deprecated.text.TokenTextEncoder
在已弃用的编码方法中 tfds.deprecated.text.TokenTextEncoder 我们首先创建一个词汇集 token
tokenizer = tfds.deprecated.text.Tokenizer()
vocabulary_set = set()
#imdb_train --> imdb dataset from tensorflow_datasets
for example, label in imdb_train:
some_tokens = tokenizer.tokenize(example.numpy())
然后载入编码器
encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set,
lowercase=True,
tokenizer=tokenizer)
之后在执行编码时我注意到编码器将输出一个整数,例如在调试时我发现单词“the”被编码为 112
token_id = encoder.encode(word)[0]
>> token_id = 112
但是说到
tensorflow.keras.preprocessing.text.Tokenizer
tokenizer = tensorflow.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(words)
token_id = tokenizer.texts_to_sequences(word) #word = the
>> token_id = [800,2085,936]
它产生了一个由 3 个整数组成的序列,那么现在我是使用所有 3 个数字,还是如果我只取该序列中的 1 个数字是否也正确?我正在尝试使用这个编码整数来使用 Glove Embedding 创建嵌入矩阵。旧的已弃用的只产生一个整数,因此更容易映射,使用整数序列我不确定如何进行
也许可以试试这样:
import tensorflow as tf
lines = ['You are a fish', 'This is a fish', 'Where are the fishes']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)
text_sequences = tokenizer.texts_to_sequences(lines)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences, padding='post')
vocab_size = len(tokenizer.word_index) + 1
print(tokenizer.word_index)
print(vocab_size)
print(tokenizer.texts_to_sequences(['fish'])[0])
{'are': 1, 'a': 2, 'fish': 3, 'you': 4, 'this': 5, 'is': 6, 'where': 7, 'the': 8, 'fishes': 9}
10
[3]
索引 0 是为填充令牌保留的。然后用 Glove 模型创建权重矩阵,试试这个:
import gensim.downloader as api
import numpy as np
model = api.load("glove-twitter-25")
embedding_dim = 25
weight_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
try:
embedding_vector = model[word]
weight_matrix[i] = embedding_vector
except KeyError:
weight_matrix[i] = np.random.uniform(-5, 5, embedding_dim)
print(weight_matrix.shape)
# (10, 25)