创建gensim字典时添加进度条（详细）

Question

我想从数据帧的行创建一个 gensim dictionary。 df.preprocessed_text 是一个单词列表。

from gensim.models.phrases import Phrases, Phraser
from gensim.corpora.dictionary import Dictionary


def create_dict(df, bigram=True, min_occ_token=3):

    token_ = df.preprocessed_text.values
    if not bigram:
        return Dictionary(token_)
    
    bigram = Phrases(token_,
                     min_count=3,
                     threshold=1,
                     delimiter=b' ')

    bigram_phraser = Phraser(bigram)

    bigram_token = []
    for sent in token_:
        bigram_token.append(bigram_phraser[sent])
    
    dictionary = Dictionary(bigram_token)
    dictionary.filter_extremes(no_above=0.8, no_below=min_occ_token)
    dictionary.compactify() 
    
    return dictionary

我找不到它的进度条选项，callbacks 似乎也不起作用。由于我的语料库很大，我真的很欣赏一种显示进度的方式。有没有？

Answer 1

我建议不要出于监控目的更改 prune_at，因为它会改变记忆 bigrams/words 的行为，可能会丢弃比限制内存使用严格要求的更多的行为。

将tqdm包装在所使用的迭代器周围（包括Phrases构造函数中的token_使用和Dictionary构造函数中的bigram_token使用）应该工作。

或者，启用 INFO 或更高级别的日志记录应该显示日志记录，虽然不像 pretty/accurate 作为进度条，但会给出一些进度指示。

进一步，如果如代码所示，使用bigram_token只是为了支持下一个Dictionary，不需要创建为full in-memory list .您应该能够只使用分层迭代器来转换文本，并逐项计算 Dictionary。例如：

    # ...
    dictionary = Dictionary(tqdm(bigram_phraser[token_]))
    # ...

（此外，如果您只使用 Phraser 一次，您可能根本不会从创建它中获得任何好处 - 当您想要继续应用相同的短语时，这是一个可选的内存优化-没有原始 Phrases 调查对象的全部开销的创建操作。但是如果 Phrases 仍在范围内，并且所有这些都将在这一步之后立即被丢弃，它可能与直接使用 Phrases 对象一样快，而无需绕道创建 Phraser - 所以试一试吧。）

创建gensim字典时添加进度条（详细）

Add progress bar (verbose) when creating gensim dictionary

python

text

dictionary

gensim

progress-bar