如何在 Tensorboard Projector 中可视化 Gensim Word2vec 嵌入
How to visualize Gensim Word2vec Embeddings in Tensorboard Projector
继gensim word2vec embedding tutorial之后,我训练了一个简单的word2vec模型:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.save("/content/word2vec.model")
我想形象化它using the Embedding Projector in TensorBoard. There is another straightforward tutorial in gensim documentation。我在 Colab 中做了以下操作:
!python3 -m gensim.scripts.word2vec2tensor -i /content/word2vec.model -o /content/my_model
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 94, in <module>
word2vec2tensor(args.input, args.output, args.binary)
File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 68, in word2vec2tensor
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_model_path, binary=binary)
File "/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py", line 1438, in load_word2vec_format
limit=limit, datatype=datatype)
File "/usr/local/lib/python3.7/dist-packages/gensim/models/utils_any2vec.py", line 172, in _load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)
File "/usr/local/lib/python3.7/dist-packages/gensim/utils.py", line 355, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
请注意,我确实首先检查了这个 - 但接受的答案不再有效,因为 gensim 和 tensorflow 都已更新,所以我认为值得在 2021 年第四季度再次询问。
以原始 C word2vec 实现格式保存模型可解决问题:
model.wv.save_word2vec_format("/content/word2vec.model")
:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("/content/word2vec.model")
在 gensim
中有两种存储 word2vec 模型的格式:来自原始 word2vec 实现的键控向量格式和额外存储隐藏权重、词汇频率等的格式。示例和详细信息可以在 documentation. The script word2vec2tensor.py
uses the original format and loads the model with load_word2vec_format
: code.
中找到
继gensim word2vec embedding tutorial之后,我训练了一个简单的word2vec模型:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.save("/content/word2vec.model")
我想形象化它using the Embedding Projector in TensorBoard. There is another straightforward tutorial in gensim documentation。我在 Colab 中做了以下操作:
!python3 -m gensim.scripts.word2vec2tensor -i /content/word2vec.model -o /content/my_model
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 94, in <module>
word2vec2tensor(args.input, args.output, args.binary)
File "/usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py", line 68, in word2vec2tensor
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_model_path, binary=binary)
File "/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py", line 1438, in load_word2vec_format
limit=limit, datatype=datatype)
File "/usr/local/lib/python3.7/dist-packages/gensim/models/utils_any2vec.py", line 172, in _load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)
File "/usr/local/lib/python3.7/dist-packages/gensim/utils.py", line 355, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
请注意,我确实首先检查了这个
以原始 C word2vec 实现格式保存模型可解决问题:
model.wv.save_word2vec_format("/content/word2vec.model")
:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
model = Word2Vec(sentences=common_texts, size=100, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("/content/word2vec.model")
在 gensim
中有两种存储 word2vec 模型的格式:来自原始 word2vec 实现的键控向量格式和额外存储隐藏权重、词汇频率等的格式。示例和详细信息可以在 documentation. The script word2vec2tensor.py
uses the original format and loads the model with load_word2vec_format
: code.