从稀疏转换为密集时，CountVectorizer 运行内存不足

Question

我正在尝试运行这段代码来计算一堆文档（很多，超过 40000）中每个单词的相对频率，我无法减少词汇量，它抛出了在具有 12 GB RAM 的 Colab 上运行nign 时出现内存错误。我怎样才能重构代码，这样我就不必调用 X.toarray() 从稀疏转换为密集并抛出内存不足错误（120000 字 * 40000 个文档）。

vect = CountVectorizer(vocabulary=list(word_to_index.keys()), tokenizer=lambda x: x.split())
X = vect.fit_transform(docs)
X_arr = X.toarray()
rel_freq = np.sum(X_arr, axis=0) / len(docs)
names = vect.get_feature_names()

如果您想知道为什么我需要这样做是因为我正在实施 ConWea 代码： https://github.com/dheeraj7596/ConWea数据量比作者大。非常感谢大家。

Answer 1

如果只需要频率，可以用sum method求稀疏矩阵：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

corpus = ['This is the first document.','This is the second second document.',
'And the third one.','Is this the first document?']

X = vectorizer.fit_transform(corpus)

X.sum(axis=0)/len(corpus)
matrix([[0.25, 0.75, 0.5 , 0.75, 0.25, 0.5 , 1.  , 0.25, 0.75]])

X.toarray().sum(axis=0)/ len(corpus)
array([0.25, 0.75, 0.5 , 0.75, 0.25, 0.5 , 1.  , 0.25, 0.75])

从稀疏转换为密集时，CountVectorizer 运行内存不足

CountVectorizer running out of memory when converting from sparse to dense

python

nlp

out-of-memory

scikit-learn

text-classification

从稀疏转换为密集时，CountVectorizer 运行 内存不足

CountVectorizer running out of memory when converting from sparse to dense

python

nlp

out-of-memory

scikit-learn

text-classification

从稀疏转换为密集时，CountVectorizer 运行内存不足