如何将 sklearn tfidf 向量 pandas 输出转换为有意义的格式

Question

我已经使用 sklearn 为我的语料库获取 tfidf 分数，但输出不是我想要的格式。

代码：

vect = TfidfVectorizer(ngram_range=(1,3))
tfidf_matrix = vect.fit_transform(df_doc_wholetext['csv_text'])

df = pd.DataFrame(tfidf_matrix.toarray(),columns=vect.get_feature_names())

df['filename'] = df.index

我有：

word1,word2,word3可以是语料库中的任意词。例如，我将它们称为 word1 、 word2 、 word3 。

我需要的：

我尝试对其进行转换，但它会将所有列转换为行。有办法实现吗？

Answer 1

df1 = df.filter(like='word').stack().reset_index()
df1.columns = ['filename','word_name','score']

输出：

   filename word_name  score
0         0     word1   0.01
1         0     word2   0.04
2         0     word3   0.05
3         1     word1   0.02
4         1     word2   0.99
5         1     word3   0.07

更新常规列 headers:

df1 = df.iloc[:,1:].stack().reset_index()

如何将 sklearn tfidf 向量 pandas 输出转换为有意义的格式

How to Transform sklearn tfidf vector pandas output to a meaningful format

python

tf-idf

pandas

scikit-learn

tfidfvectorizer