无法理解 keras.datasets.imdb

Question

我有两个问题：

首先，tf.keras.datasets.imdb.get_word_index 的文档说

Retrieves the dictionary mapping word indices back to words.

虽然事实上恰恰相反，

print(tf.keras.datasets.imdb.get_word_index())

{'fawn': 34701, 'tsukino': 52006, 'nunnery': 52007

我在 TensorFlow 2.0 中尝试运行这个

(train_data_raw, train_labels), (test_data_raw, test_labels) = keras.datasets.imdb.load_data()
words2idx = tf.keras.datasets.imdb.get_word_index()
idx2words = {idx:word for word, idx in words2idx.items()}
i = 0
train_ex = [idx2words[x] for x in train_data_raw[0]]
train_ex = ' '.join(train_ex)
print(train_ex)

这导致了一个无意义的字符串

the as you with out themselves powerful lets loves their [...]

我不应该得到有效的电影评论吗？

Answer 1

我做了一些挖掘，发现在处理过程中有一些 "offsets" 需要撤消，以便 return 成为一种明智的审查语言。我修改了你的行，从原始序列中出现的索引中减去 3（因为默认是以索引 = 3 开始真实的单词），而且第一个字符是一个虚拟标记（设置为 1），所以真实的文本从位置 2（或 python 中的索引 1）开始。

train_ex = [idx2words[x-3] for x in train_data_raw[0][1:]]

使用上述修改后，我得到了您最初选择的评论的以下内容：

this film was just brilliant casting location scenery story direction everyone's really suited the part they played ...

似乎删除了一些标点符号和大写字母等，但这似乎是 return 明智的评论。

希望对您有所帮助。

无法理解 keras.datasets.imdb

Cannot make sense of keras.datasets.imdb

imdb

dataset

keras

tensorflow

tensorflow-datasets