Python - 使用 TF-IDF 汇总数据框文本列

Python - Using TF-IDF to summarise dataframe text column

我有一个包含文本列的数据框。

我想创建一个新列,在每行中包含 tuple/list 个最高 'n' TF-IDF 评分词,作为总结文本内容的一种方式。

一个示例数据框(非常简洁)是:

df = pd.DataFrame({'Ref': [1,2,3,4,5], 'Text': ["the cow jumped off the other cow", 
                                                "the fox had a fox", 
                                                "the spanner was a tool to tool", 
                                                "the football player played football",
                                                "the house had a house"]})

过去几天我一直在寻找解决方案,但我只能找到为整个语料库找到顶部 TF-IDF 单词的示例,而不是基于整个语料库的数据框中的每一行.

谁能指引我正确的方向?

这是一个可能的解决方案:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

n = 3 # top n TF-IDF words

tfidf = TfidfVectorizer(token_pattern=r"\w+") # no words are left out
X = tfidf.fit_transform(df['Text'])
ind = (-X.todense()).argpartition(n)[:, :n]
top_words = pd.Series(
    map(
        lambda words_values: dict(zip(*words_values)),
        zip(
            np.array(tfidf.get_feature_names())[ind],
            np.asarray(np.take_along_axis(X, ind, axis=1).todense()),
        ),
    ),
)

结果如下:

>>> top_words
0    {'cow': 0.7111977362687212, 'other': 0.3555988681343606, 'off': 0.3555988681343606}
1    {'fox': 0.8665817814049075, 'had': 0.34957636239744133, 'a': 0.2901799593148741}
2    {'tool': 0.7218960199361867, 'was': 0.36094800996809334, 'spanner': 0.36094800996809334}
3    {'football': 0.8014723840888909, 'player': 0.40073619204444544, 'played': 0.40073619204444544}
4    {'house': 0.8665817814049075, 'had': 0.34957636239744133, 'a': 0.2901799593148741}