如何使用 gensim 运行 Windows 上的 word2vec

Question

几年前，我团队的一位前开发人员编写了以下 Python 调用 word2vec 的代码，传入训练文件和输出文件的位置。他致力于 Linux。我被要求在 Windows 机器上获取此运行。请记住 我几乎不知道 Python，我已经安装了 Gensim，我猜它现在实现了 word2vec，但不知道如何重写代码以使用该库而不是似乎无法在 Windows 框上编译的可执行文件。有人可以帮我更新此代码吗？

#!/usr/bin/env python3

import os
import csv
import subprocess
import shutil

from gensim.models import word2vec

def train_word2vec(trainFile, output):
    # run word2vec:
    subprocess.run(["word2vec", "-train", trainFile, "-output", output,
                    "-cbow", "0", "-window", "10", "-size", "100"],
                   shell=False)
    # Remove some invalid unicode:
    with open(output, 'rb') as input_,\
         open('%s.new' % output, 'w') as new_output:
        for line in input_:
            try:
                print(line.decode('utf-8'), file=new_output, end='')
            except UnicodeDecodeError:
                print(line)
                pass
    shutil.move('%s.new' % output, output)

def main():
    train_word2vec("c:/temp/wc/test1_BigF.txt", "c:/temp/wc/test1_w2v_model.txt")

if __name__ == '__main__':
    main()

Answer 1

首先，您需要发布不完整的代码，或者您的脚本缺少以下部分，使其能够从命令行获取参数（将其添加到脚本底部）：

if __name__ == '__main__':
    import sys
    train_word2vec(sys.argv[1], sys.argv[2])

然后运行脚本（Python 被解释，而不是编译）在命令行中（大约）以下列方式：

python.exe your_script_file.py pathToInput pathToOutput

Answer 2

我认为你所追求的核心是这样的：

import sys

from gensim.models.word2vec import Word2Vec

def train_word2vec(trainFile, output):
    # compile word arrays for each sentence of input vocab
    sentences = list(line.split() for line in open(trainFile))

    # effective executable invocation of original code (included for reference)
    # word2vec -train {trainFile} -output {output} -cbow 0 -window 10 -size 100

    # invocation via word2vec module with (mostly) equivalent params
    model = Word2Vec(sentences, size=100, window=10, min_count=1, workers=4)

    # save generated model        
    model.save(output)

if __name__ == '__main__':
    train_word2vec(sys.argv[1], sys.argv[2])

另存为train.py调用如下：

python train.py input.txt output.txt

注意几点：

模块名称 (word2vec) 和导入的 class (Word2Vec) 的名称大小写不同。如果你把它们混在一起，会崩溃。
我没有 found/included 命令行 -cbow 0 参数的等价物。我猜想这表明 Skip-gram 算法比 CBOW 更受欢迎，但需要比我有更多 gensim 经验的人来就其后果提出建议 - 或者实际上是那些将其排除在外的人。
我也没有包括（或试图复制）原始的 Unicode 删除逻辑。生成的模型输出主要是二进制数据，因此采用 'as is' 它 (a) 几乎立即崩溃并且 (b) 让我对它试图实现的目标一无所知。

希望这对您有所帮助。

如何使用 gensim 运行 Windows 上的 word2vec

How to run word2vec on Windows using gensim

python

python-3.x

gensim

word2vec