如果我不提供 oov_token,tensorflow 中的 Tokenizer 如何处理词汇量不足的标记?
How does Tokenizer in tensorflow deal with out of vocabulary tokens if I don't provide oov_token?
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
encoded_docs = tokenizer.texts_to_sequences(X_train)
padded_sequence = pad_sequences(encoded_docs, maxlen=60)
test_tweets = tokenizer.texts_to_sequences(X_test)
test_padded_sequence = pad_sequences(test_tweets, maxlen=60)
即使我没有提供 oov_token
参数,我也没有收到该代码的任何错误。我预计会在 test_tweets = tokenizer.texts_to_sequences(X_test)
中出现错误
当你不提供 oov_token
时,tensorflow 如何处理测试期间词汇量不足的单词?
如果 oov_token
是 None
:
,默认情况下将忽略/丢弃 OOV 词
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(['hello world'])
print(tokenizer.word_index)
sequences = tokenizer.texts_to_sequences(['hello friends'])
print(sequences)
{'hello': 1, 'world': 2}
[[1]]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
encoded_docs = tokenizer.texts_to_sequences(X_train)
padded_sequence = pad_sequences(encoded_docs, maxlen=60)
test_tweets = tokenizer.texts_to_sequences(X_test)
test_padded_sequence = pad_sequences(test_tweets, maxlen=60)
即使我没有提供 oov_token
参数,我也没有收到该代码的任何错误。我预计会在 test_tweets = tokenizer.texts_to_sequences(X_test)
当你不提供 oov_token
时,tensorflow 如何处理测试期间词汇量不足的单词?
如果 oov_token
是 None
:
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(['hello world'])
print(tokenizer.word_index)
sequences = tokenizer.texts_to_sequences(['hello friends'])
print(sequences)
{'hello': 1, 'world': 2}
[[1]]