试图利用图书馆进行一些主题建模，但进展不顺利

Question

我有一个 .csv 术语文档矩阵，我想在 python 中使用 gensim 执行一些潜在狄利克雷分配。不过我对PythonorLDA不是特别熟悉。

我在 gensim...论坛发帖了？我不知道这是否是所谓的。写包裹的人回应说：

how big is your term-document CSV matrix?

If it's small enough = fits in RAM, you could:

1) use numpy.loadtxt() to load your CSV into an in-memory matrix

2) convert the matrix to a corpus with gensim.matutils.Dense2Corpus() . Check out its documents_columns flag, it lets you switch between document-term and term-document transposition easily.

3) use that corpus to train your LDA model.

所以这让我相信 this question 的答案是不正确的。

看起来字典是LDA模型的必要输入；这不对吗？这是我认为成功地将 .csv 插入语料库的内容。

file = np.genfromtxt(fname=fPathName, dtype="int", delimiter=",", skip_header=True, missing_values="", filling_values=0)


corpus = gensim.matutils.Dense2Corpus(file, documents_columns=False)

如有任何帮助，我们将不胜感激。

编辑：原来 Gensim 字典和 Python 字典并不完全相同。

Answer 1

因此，我从 Gensim 文档中截取了这段代码：

from gensim.models import LdaModel
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=10)

您要分析的文件是 csv，因此您可以使用 pandas

打开它

import pandas as pd
df = pd.read_csv(filename) # add header=None if the file has no column names

导入文件后，所有内容都加载到数据框中，您需要将所有文本合并到一个唯一的列表中（请参阅 gensim 代码片段的第一条评论），该列表应如下所示

["text one..", "text 2..", "text 3..."]

您可以通过遍历数据框并将文本迭代添加到空列表来实现。在此之前，您还需要检查 csv 文件的哪一列包含要分析的文本。

common_texts = [] # initialise empty list
for ind, row in df.iteritem():
    text = row[name_column_with_text]
    common_texts.append(text)

获得文本列表后，您可以简单地应用 gensim 文档中的代码。当然你可能会遇到内存问题，这取决于你的 csv 文件的大小。

试图利用图书馆进行一些主题建模，但进展不顺利

Trying to make use of a library to conduct some topic modeling, but it's not going well

python

corpus

lda

gensim