计算Tfidf矩阵与预测向量相似度导致内存溢出

Calculating similarity between Tfidf matrix and predicted vector causes memory overflow

我已经使用以下代码在约 20,000,000 个文档上生成了一个 tf-idf 模型,效果很好。问题是当我尝试使用 linear_kernel 计算相似度分数时,内存使用量激增:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

train_file = "docs.txt"
train_docs = DocReader(train_file) #DocReader is a generator for individual documents

vectorizer = TfidfVectorizer(stop_words='english',max_df=0.2,min_df=5)
X = vectorizer.fit_transform(train_docs)

#predicting a new vector, this works well when I check the predictions
indoc = "This is an example of a new doc to be predicted"
invec = vectorizer.transform([indoc])

#This is where the memory blows up
similarities = linear_kernel(invec, X).flatten()

看起来这应该不会占用太多内存,将 1 行 CSR 与 2000 万行 CSR 进行比较应该输出 1x2000 万 ndarray。

Justy FYI:X 是一个 CSR 矩阵,内存约 12 GB(我的电脑只有 16 个)。我试过研究 gensim 来替换它,但我找不到一个很好的例子。

对我遗漏的东西有什么想法吗?

您可以批量处理。这是一个基于您的代码片段的示例,但将数据集替换为 sklearn 中的内容。对于这个较小的数据集,我也按照原来的方式计算它,以表明结果是等价的。您或许可以使用更大的批量大小。

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.datasets import fetch_20newsgroups

train_docs = fetch_20newsgroups(subset='train')

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.2,min_df=5)
X = vectorizer.fit_transform(train_docs.data)

#predicting a new vector, this works well when I check the predictions
indoc = "This is an example of a new doc to be predicted"
invec = vectorizer.transform([indoc])

#This is where the memory blows up
batchsize = 1024
similarities = []
for i in range(0, X.shape[0], batchsize):
    similarities.extend(linear_kernel(invec, X[i:min(i+batchsize, X.shape[0])]).flatten())
similarities = np.array(similarities)
similarities_orig = linear_kernel(invec, X)
print((similarities == similarities_orig).all())

输出:

True