在多个工人的支持下,在 gensim 中批量训练 word2vec
Batch-train word2vec in gensim with support of multiple workers
上下文
关于如何使用 gensim
和流数据训练 Word2Vec
存在几个问题。无论如何,这些问题不涉及流式处理不能使用多个工作程序的问题,因为没有数组可以在线程之间拆分。
因此我想创建一个为 gensim 提供此类功能的生成器。我的结果如下:
from gensim.models import Word2Vec as w2v
#The data is stored in a python-list and unsplitted.
#It's too much data to store it splitted, so I have to do the split while streaming.
data = ['this is document one', 'this is document two', ...]
#Now the generator-class
import threading
class dataGenerator:
"""
Generator for batch-tokenization.
"""
def __init__(self, data: list, batch_size:int = 40):
"""Initialize generator and pass data."""
self.data = data
self.batch_size = batch_size
self.lock = threading.Lock()
def __len__(self):
"""Get total number of batches."""
return int(np.ceil(len(self.data) / float(self.batch_size)))
def __iter__(self) -> list([]):
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly).
Allows for data-streaming.
"""
for idx in range(len(self)):
yield self[idx]
def __getitem__(self, idx):
#Make multithreading thread-safe
with self.lock:
# Returns current batch by slicing data.
return [arr.split(" ") for arr in self.data[idx * self.batch_size : (idx + 1) * self.batch_size]]
#And now do the training
model = w2v(
sentences=dataGenerator(data),
size=300,
window=5,
min_count=1,
workers=4
)
这会导致错误
TypeError: unhashable type: 'list'
由于 dataGenerator(data)
如果我只生成一个拆分文档就可以工作,我假设 gensims word2vec
将生成器包装在一个额外的列表中。在这种情况下,__iter__
看起来像:
def __iter__(self) -> list:
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly.
Allows for data-streaming.
"""
for text in self.data:
yield text.split(" ")
因此,我的批次也会被包装,导致类似 [[['this', '...'], ['this', '...']], [[...], [...]]]
(=> list of list of list)的东西无法被 gensim 处理。
我的问题:
我可以 "stream"-传递批次以使用多个 worker 吗?
如何相应地更改我的代码?
看来是我太不耐烦了。我 运行 上面写的流函数只处理一个文档而不是批处理:
def __iter__(self) -> list:
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly.
Allows for data-streaming.
"""
for text in self.data:
yield text.split(" ")
启动 w2v
功能后,大约需要十分钟,所有内核才能正常工作。
似乎构建词汇表不支持多核,因此,此任务只使用了一个。据推测,它花了这么长时间是因为语料库的大小。 gensim构建vocab后,所有的cores都被用于训练。
因此,如果您也在这个问题中 运行,也许一些耐心会有所帮助 :)
只是想重申一下
is the way to go: with a large corpus and multiple cpus, it's much faster to train gensim word2vec using the corpus_file
parameter instead of sentences
, as mentioned in the docs:
- corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).
LineSentence 格式基本上是每行一个句子,单词 space 分隔。纯文本,.bz2 或 gz.
上下文
关于如何使用 gensim
和流数据训练 Word2Vec
存在几个问题。无论如何,这些问题不涉及流式处理不能使用多个工作程序的问题,因为没有数组可以在线程之间拆分。
因此我想创建一个为 gensim 提供此类功能的生成器。我的结果如下:
from gensim.models import Word2Vec as w2v
#The data is stored in a python-list and unsplitted.
#It's too much data to store it splitted, so I have to do the split while streaming.
data = ['this is document one', 'this is document two', ...]
#Now the generator-class
import threading
class dataGenerator:
"""
Generator for batch-tokenization.
"""
def __init__(self, data: list, batch_size:int = 40):
"""Initialize generator and pass data."""
self.data = data
self.batch_size = batch_size
self.lock = threading.Lock()
def __len__(self):
"""Get total number of batches."""
return int(np.ceil(len(self.data) / float(self.batch_size)))
def __iter__(self) -> list([]):
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly).
Allows for data-streaming.
"""
for idx in range(len(self)):
yield self[idx]
def __getitem__(self, idx):
#Make multithreading thread-safe
with self.lock:
# Returns current batch by slicing data.
return [arr.split(" ") for arr in self.data[idx * self.batch_size : (idx + 1) * self.batch_size]]
#And now do the training
model = w2v(
sentences=dataGenerator(data),
size=300,
window=5,
min_count=1,
workers=4
)
这会导致错误
TypeError: unhashable type: 'list'
由于 dataGenerator(data)
如果我只生成一个拆分文档就可以工作,我假设 gensims word2vec
将生成器包装在一个额外的列表中。在这种情况下,__iter__
看起来像:
def __iter__(self) -> list:
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly.
Allows for data-streaming.
"""
for text in self.data:
yield text.split(" ")
因此,我的批次也会被包装,导致类似 [[['this', '...'], ['this', '...']], [[...], [...]]]
(=> list of list of list)的东西无法被 gensim 处理。
我的问题:
我可以 "stream"-传递批次以使用多个 worker 吗? 如何相应地更改我的代码?
看来是我太不耐烦了。我 运行 上面写的流函数只处理一个文档而不是批处理:
def __iter__(self) -> list:
"""
Iterator-wrapper for generator-functionality (since generators cannot be used directly.
Allows for data-streaming.
"""
for text in self.data:
yield text.split(" ")
启动 w2v
功能后,大约需要十分钟,所有内核才能正常工作。
似乎构建词汇表不支持多核,因此,此任务只使用了一个。据推测,它花了这么长时间是因为语料库的大小。 gensim构建vocab后,所有的cores都被用于训练。
因此,如果您也在这个问题中 运行,也许一些耐心会有所帮助 :)
只是想重申一下
corpus_file
parameter instead of sentences
, as mentioned in the docs:
- corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).
LineSentence 格式基本上是每行一个句子,单词 space 分隔。纯文本,.bz2 或 gz.