pd.DataFrame() 中的内存错误

Question

我有一个包含 2 列和 500 万行的 df，全是文本（客户对企业的评论）。 df.head() 产生：

df.info() 显示内存使用只有120.3+ MB

我正在尝试使用 gensim 库对 df['text'] 进行主题建模。我尝试先创建文档术语矩阵 (dtm)，然后执行潜在狄利克雷分配 (LDA)，如下所示：

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from gensim import matutils, models
import scipy.sparse

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(df.text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names()) #LINE THROWING MemoryError

data_dtm.index = df.index

tdm = data_dtm.transpose()

sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

id2word = dict((v, k) for k, v in cv.vocabulary_.items())
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

问题：但是第 7 行 (pd.DataFrame()) 在我仍有 60% 的机器内存可用时抛出 MemoryError。即使当我对 df 的前 100,000 行重复操作时，我也会得到相同的 MemoryError。

由于这是主题建模，我宁愿将所有行一起分析，或者至少分几批分析。

问题将 data_cv 转换为数据帧时，是什么导致 Python 运行内存不足？我怎样才能克服它？

Answer 1

很可能 data_cv.toarray() 就是 most-responsible 内存扩展，通过将有效的稀疏表示转换为 full/dense 数组。

尝试在一个单独的前一行执行该步骤到一个临时变量中，以进行检查。

但是，如果您的 end-goal 正在进行 Gensim LDA 分析，那么这些步骤可以很好地使用标记化文本作为输入，因此其他操作（涉及 CountVectorizer 并存储临时结果或巨大的 term-document Pandas 数据结构中的数组）可能是多余的。

例如（忽略任何停用词过滤）：

from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel

tokenized_texts = [text_string.split() for text_string in df.text]
texts_dictionary = Dictionary(tokenized_texts)
texts_bows = [texts_dictionary.doc2bow(text_tokens) for text_tokens in tokenized_texts]

lda = LdaModel(corpus=texts_bows, id2word=texts_dictionary, num_topics=2)

这仍然会在内存中创建 df.text 列的两个巨大副本 、一个 tokenized_texts 列表和一个 texts_bows 列表，所以它使用比最佳内存更多的内存。（very-large 语料库的最佳做法是将它们留在磁盘上，然后将它们 item-by-item 流式传输到处理步骤中。）

但是，它可能使用的步骤少于巨大的密集 .toarray() 步骤，甚至从未创建 not-strictly-necessary CountVectorizer 或临时数组和 DataFrame 对象。

pd.DataFrame() 中的内存错误

MemoryError in pd.DataFrame()

pandas

lda

gensim

topic-modeling