Facebook fasttext bin 模型 UnicodeDecodeError

Question

我从 facebook (https://fasttext.cc/docs/en/crawl-vectors.html) 下载了预训练词向量文件 (.bin) 但是，当我尝试使用这个模型时，它发生了错误。

from gensim.models import FastText
fasttext_model = FastText.load_fasttext_format('cc.ko.300.bin', encoding='utf8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

但奇怪的是，当我使用旧版本的bin文件时它运行良好(https://fasttext.cc/docs/en/pretrained-vectors.html)

这些文件有什么问题？？我该如何解决？

而且我必须使用 bin 文件，因为我需要所有 n-gram 来防止 OOV。因此，'use .vec file' 之类的解决方案没有任何帮助。

非常感谢:)

Answer 1

确保您使用的是最新 (3.7.1) 版本的 gensim；最近对 load_fasttext_model() 进行了修复和改进。

此外，请仔细检查您下载的 cc.ko.300.bin，以确保它没有被损坏或截断。

如果这些都没有帮助，请尝试在 INFO 级别启用日志记录，再次尝试加载，并在您的问题中共享完整的输出和错误堆栈，以提供有关问题所在的更多提示。

Answer 2

原来FB Koean fasttext模型有一些奇怪的unicode，gensim会更新这个问题。

https://github.com/RaRe-Technologies/gensim/issues/2402

Answer 3

最好使用 fastText 包而不是 gensim 加载 fastText 词嵌入。

您需要先使用 pip install fasttext

为 python 安装 fasttext 模块

然后按照下面的 python 代码块：

import fasttext
model = fasttext.load_model("path/2/pretrained_fastText_word_embeddings.bin")

代码来源：

Facebook fasttext bin 模型 UnicodeDecodeError

Facebook fasttext bin model UnicodeDecodeError

python

facebook

utf-8

gensim

fasttext