训练 word2vec 模型从文件流式传输数据并将其标记为句子
Training word2vec model streaming data from file and tokenize to sentence
我需要处理大量 txt
文件来构建 word2vec
模型。
现在,我的 txt 文件有点乱,我需要删除所有“\n
”换行符,从我加载的字符串(txt 文件)中读取所有句子,然后标记每个句子以使用 word2vec 模型。
问题是:我无法逐行阅读文件,因为有些句子不会在一行后结束。因此,我使用 ´nltk.tokenizer.tokenize()
´,它将文件拆分成句子。
I cant figure out, how to convert a list of strings into a list of list, where each sub-list contains the sentences, while passing it thourgh a generator.
或者我真的需要将每个句子保存到一个新文件中(每行一个句子)以通过生成器传递它吗?
好吧,我的代码如下所示:
´tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
# initialize tokenizer for processing sentences
class Raw_Sentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for file in file_loads: ## Note: file_loads includes directory name of files (e.g. 'C:/Users/text-file1.txt')
with open(file,'r', encoding='utf-8') as t:
# print(tokenizer.tokenize(t.read().replace('\n', ' ')))
storage = tokenizer.tokenize(t.read().replace('\n', ' '))
# I tried to temporary store the list of sentences to a list for an iteration
for sentence in storage:
print(nltk.word_tokenize(sentence))
yield nltk.word_tokenize(sentence)´
所以目标是:
加载文件 1:´'some messy text here. And another sentence'
´
分词成句子 ´['some messy text here','And another sentence']
´
然后将每个句子拆分成单词 ´[['some','messy','text','here'],['And','another','sentence']]
´
加载文件 2:'some other messy text. sentence1. sentence2.'
等等
并将句子输入word2vec模型:
´sentences = Raw_Sentences(directory)
´
´model = gensim.models.Word2Vec(sentences)
´
好吧...在写下所有内容并重新考虑之后...我想我已经解决了我自己的问题。 如有错误请指正:
要迭代 nltk punkt 句子分词器创建的每个句子,必须将其直接传递给 for 循环:
def __iter__(self):
for file in file_loads:
with open(file,'r') as t:
for sentence in tokenizer.tokenize(t.read().replace('\n',' ')):
yield nltk.word_tokenize(sentence)
一如既往,还有 yield gensim.utils.simple_preprocess(sentence, deacc= True)
的替代方案
将其输入 sentence = Raw_Sentences(directory)
构建一个正常工作的 Word2Vec gensim.models.Word2Vec(sentences)
我需要处理大量 txt
文件来构建 word2vec
模型。
现在,我的 txt 文件有点乱,我需要删除所有“\n
”换行符,从我加载的字符串(txt 文件)中读取所有句子,然后标记每个句子以使用 word2vec 模型。
问题是:我无法逐行阅读文件,因为有些句子不会在一行后结束。因此,我使用 ´nltk.tokenizer.tokenize()
´,它将文件拆分成句子。
I cant figure out, how to convert a list of strings into a list of list, where each sub-list contains the sentences, while passing it thourgh a generator.
或者我真的需要将每个句子保存到一个新文件中(每行一个句子)以通过生成器传递它吗?
好吧,我的代码如下所示:
´tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
# initialize tokenizer for processing sentences
class Raw_Sentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for file in file_loads: ## Note: file_loads includes directory name of files (e.g. 'C:/Users/text-file1.txt')
with open(file,'r', encoding='utf-8') as t:
# print(tokenizer.tokenize(t.read().replace('\n', ' ')))
storage = tokenizer.tokenize(t.read().replace('\n', ' '))
# I tried to temporary store the list of sentences to a list for an iteration
for sentence in storage:
print(nltk.word_tokenize(sentence))
yield nltk.word_tokenize(sentence)´
所以目标是:
加载文件 1:´'some messy text here. And another sentence'
´
分词成句子 ´['some messy text here','And another sentence']
´
然后将每个句子拆分成单词 ´[['some','messy','text','here'],['And','another','sentence']]
´
加载文件 2:'some other messy text. sentence1. sentence2.'
等等
并将句子输入word2vec模型:
´sentences = Raw_Sentences(directory)
´
´model = gensim.models.Word2Vec(sentences)
´
好吧...在写下所有内容并重新考虑之后...我想我已经解决了我自己的问题。 如有错误请指正:
要迭代 nltk punkt 句子分词器创建的每个句子,必须将其直接传递给 for 循环:
def __iter__(self):
for file in file_loads:
with open(file,'r') as t:
for sentence in tokenizer.tokenize(t.read().replace('\n',' ')):
yield nltk.word_tokenize(sentence)
一如既往,还有 yield gensim.utils.simple_preprocess(sentence, deacc= True)
将其输入 sentence = Raw_Sentences(directory)
构建一个正常工作的 Word2Vec gensim.models.Word2Vec(sentences)