文本数据上的 MiniBatchSparsePCA

Question

目标

我正在尝试复制此 paper（第 4.1 节）中描述的应用程序，其中将稀疏主成分分析应用于文本语料库，输出为 K 个主成分，每个主成分显示一个 'structure that is otherwise hidden'。换句话说，每个主成分都应该包含一个单词列表，所有这些单词都有一个共同的主题。

我已经使用 sklearn 的 MiniBatchSparsePCA 包来尝试复制应用程序，尽管我的输出是一个零矩阵。

数据
我的数据来自一项在 Stata 中清理过的调查。它是一个包含 386 个答案的向量；这是句子。

我的尝试

# IMPORT LIBRARIES #
####################################
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn import decomposition
####################################

# USE SKLEARN TO IMPORT STATA DATA. #
# Data comes from a survey, which was cleaned using Stata.

####################################
data_source = "/Users/****/q19_free_text.dta"
raw_data = pd.read_stata(data_source) #Reading in the data from a Stata file.  
text_data = raw_data.iloc[:,1] #Cleaning out Observation ID number.
text_data.shape     # Out[268]: (368, ) - There are 368 text (sentence) answers.
####################################

# Term Frequency – Inverse Document- Word Frequency
####################################
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
X_train = vectorizer.fit_transform(text_data)

spca = decomposition.MiniBatchSparsePCA(n_components=2, alpha=0.5)
spca.fit(X_train) 
#TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

X_train2 = X_train.toarray() #Trying with a dense array...
spca.fit(X_train2)

components = spca.components_


print(components)  #Out: [[ 0.  0.  0. ...,  0.  0.  0.]
                   #     [ 0.  0.  0. ...,  0.  0.  0.]]

components.shape   #Out: (2, 916)

# Empty output!

其他注意事项

我使用这些资源编写了上面的代码：

Official Example

Vectorising Text data

Previous question on the same problem

Answer 1

(...) to do something similar to that which is done in section 4.1 in the paper linked. There they 'summarize' a text corpus by using SPCA and the output is K components, where each component is a list of words (or, features).

如果我没理解错的话，你问的是如何检索组件的单词。

您可以通过检索组件中非零条目的索引来执行此操作（在 components 上使用适当的 numpy 代码）。然后使用 vectorizer.vocabulary_ 你可以找出在你的组件中找到了哪些索引 (words/tokens)。

参见 this notebook 示例实现（我使用了 20 个新闻组数据集）。

文本数据上的 MiniBatchSparsePCA

MiniBatchSparsePCA on Text Data

machine-learning

text-mining

pca

scikit-learn

sklearn-pandas