训练 word2vec 模型从文件流式传输数据并将其标记为句子

Training word2vec model streaming data from file and tokenize to sentence

我需要处理大量 txt 文件来构建 word2vec 模型。 现在,我的 txt 文件有点乱,我需要删除所有“\n”换行符,从我加载的字符串(txt 文件)中读取所有句子,然后标记每个句子以使用 word2vec 模型。

问题是:我无法逐行阅读文件,因为有些句子不会在一行后结束。因此,我使用 ´nltk.tokenizer.tokenize()´,它将文件拆分成句子。

I cant figure out, how to convert a list of strings into a list of list, where each sub-list contains the sentences, while passing it thourgh a generator.

或者我真的需要将每个句子保存到一个新文件中(每行一个句子)以通过生成器传递它吗?

好吧,我的代码如下所示: ´tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# initialize tokenizer for processing sentences

class Raw_Sentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for file in file_loads: ## Note: file_loads includes directory name of files (e.g. 'C:/Users/text-file1.txt')
            with open(file,'r', encoding='utf-8') as t:     
               # print(tokenizer.tokenize(t.read().replace('\n', ' ')))           
                storage = tokenizer.tokenize(t.read().replace('\n', ' '))
# I tried to temporary store the list of sentences to a list for an iteration
                for sentence in storage:
                    print(nltk.word_tokenize(sentence))
                    yield nltk.word_tokenize(sentence)´

所以目标是: 加载文件 1:´'some messy text here. And another sentence'´ 分词成句子 ´['some messy text here','And another sentence']´ 然后将每个句子拆分成单词 ´[['some','messy','text','here'],['And','another','sentence']]´

加载文件 2:'some other messy text. sentence1. sentence2.' 等等

并将句子输入word2vec模型: ´sentences = Raw_Sentences(directory)´

´model = gensim.models.Word2Vec(sentences)´

好吧...在写下所有内容并重新考虑之后...我想我已经解决了我自己的问题。 如有错误请指正:

要迭代 nltk punkt 句子分词器创建的每个句子,必须将其直接传递给 for 循环:

def __iter__(self):
    for file in file_loads:
       with open(file,'r') as t:
           for sentence in tokenizer.tokenize(t.read().replace('\n',' ')):
                yield nltk.word_tokenize(sentence) 

一如既往,还有 yield gensim.utils.simple_preprocess(sentence, deacc= True)

的替代方案

将其输入 sentence = Raw_Sentences(directory) 构建一个正常工作的 Word2Vec gensim.models.Word2Vec(sentences)