Gensim 的 FastText 和 Facebook 的 FastText 的区别

Question

我意识到存在 FastText 的原始实现 here by which you can use fasttext.train_unsupervised in order to generate word vectors (see this link as an example). However, turns out that gensim also supports fasttext and its API is similar to that of word2vec. See example here。

我想知道这两种实现方式之间是否存在差异？文档不清楚 但它们都模仿论文 Enriching Word Vectors with Subword Information 吗？如果是，那么为什么要使用 gensim 的 fasttext 而不是 fasttext？

Answer 1

我发现与 gensim's documentation 有 1 个不同：

word_ngrams (int, optional) – In Facebook’s FastText, “max length of word ngram” -
but gensim only supports the default of 1 (regular unigram word handling).

这意味着gensim仅支持unigrams，但不支持bigrams或trigrams。

Answer 2

Gensim 打算匹配 Facebook 的实施，但有一些已知或有意的差异。具体来说，Gensim 没有实现：

-supervised 选项，以及特定模式 autotuning/quantization/pretrained-vectors 选项
word-multigrams（由 fasttext 的 -wordNgrams 参数控制）
损失优化的普通softmax选项

关于 -loss 的选项，我比较确定，尽管 Facebook's command-line options docs indicating that the fasttext default is softmax, it is actually ns except when in -supervised mode, just like word2vec.c & Gensim. See for example this source code.

我怀疑未来对 Gensim 的贡献会增加 wordNgrams 支持，如果该模式对某些用户有用，并且与参考实现相匹配。

到目前为止，Gensim 的选择一直是避免任何监督算法，因此 -supervised 模式不太可能出现在任何未来的 Gensim 中。（不过，如果贡献了一个有效的实现，我会赞成。）

普通的 softmax 模式在典型的大型输出词汇表上要慢得多，以至于很少有非学术项目想要在 hs 或 ns 上使用它。（尽管在 -supervised 模式下，它可能仍然适用于较少数量的输出标签。）

Gensim 的 FastText 和 Facebook 的 FastText 的区别

Difference between Gensim's FastText and Facebook's FastText

gensim

fasttext