'utf-8' 加载 word2vec 模块时解码错误

Question

我必须使用包含大量汉字的word2vec 模块。该模块由我的同事使用 Java 训练，并保存为 bin 文件。

我安装了 gensim 并尝试加载模块，但出现以下错误：

In [1]: import gensim  

In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data

我尝试在 python 2.7 和 3.5 中加载模块，但同样失败。那么如何在gensim中加载模块呢？谢谢

Answer 1

该模块是 Java 训练的大量汉字。我无法弄清楚原始语料库的编码格式。错误可以按照gensim FAQ,

中的描述解决

使用 load_word2vec_format 和忽略字符解码错误的标志：

In [1]: import gensim

In [2]: model = gensim.models.Word2Vec.load_word2vec_format('/data5/momo-projects/user_interest_classification/code/word2vec/vectors_groups_1105.bin', binary=True, unicode_errors='ignore')

但我不知道忽略编码错误是否重要。

Answer 2

我试过flag

unicode_errors='ignore'

但并没有解决unicode问题。

我检查了我在将文件从 filename.bin.gz 重命名为 filename.gz 后出现了 unicode 错误。

我的解决方案是提取压缩文件，而不是重命名它。

然后我用上面flag的文件，没有unicode错误。

请注意，我将 Mac (Sierra) 与 python 2.7 一起使用。

'utf-8' 加载 word2vec 模块时解码错误

'utf-8' decode error when loading a word2vec module

python

nlp

gensim

word2vec