训练后的Phrase trigrams gensim模型如何存储

How to store the Phrase trigrams gensim model after training

我想知道在对句子进行训练后是否可以存储 gensim 短语模型

documents = ["the mayor of new york was there", "human computer interaction and 
machine learning has now become a trending research area","human computer interaction 
is interesting","human computer interaction is a pretty interesting subject", "human 
computer interaction is a great and new subject", "machine learning can be useful 
sometimes","new york mayor was present", "I love machine learning because it is a new 
subject area", "human computer interaction helps people to get user friendly 
applications"]

sentences = [doc.split(" ") for doc in documents]

bigram_transformer = Phrases(sentences)
bigram_sentences = bigram_transformer[sentences]
print("Bigrams - done")
# Here we use a phrase model that detects the collocation of 3 words (trigrams).
trigram_transformer = Phrases(bigram_sentences)
trigram_sentences = trigram_transformer[bigram_sentences]
print("Trigrams - done")

如何物理存储 trigram_transformer 以便使用 pickle 再次使用它?

预先感谢您的帮助。

将列表或特定格式转换为 numpy 数组并将其保存为 .npy 文件,易于保存和阅读,numpy 使用它可以让您在几乎每个平台上加载它,例如 google colab, replit ..... 参考这个 link 了解更多关于保存 npy 文件的细节 numpy.save()

使用 pickle 也是一个不错的选择,但是当编码标准不同并出现此类问题时,事情会变得有点棘手。

您可以使用 Gensim 的原生 .save() 方法:

trigram_transformer.save(TRIPHRASER_PATH)

...然后类似地重新加载:

reloads_trigram_transformer = Phrases.load(TRIPHRASER_PATH)

(Gensim save/load 方法通常使用 Python pickling,但可能对某些模型和 version-transitions 特殊处理某些属性。)

您也可以使用 Python 自己的 pickle,它应该可以正常工作 unless/until 您尝试将 too-old 模型加载到更新版本的 Gensim 中,这可能会改变一些事情关于 Phrases 模型。