Python - 使用 TF-IDF 汇总数据框文本列
Python - Using TF-IDF to summarise dataframe text column
我有一个包含文本列的数据框。
我想创建一个新列,在每行中包含 tuple/list 个最高 'n' TF-IDF 评分词,作为总结文本内容的一种方式。
一个示例数据框(非常简洁)是:
df = pd.DataFrame({'Ref': [1,2,3,4,5], 'Text': ["the cow jumped off the other cow",
"the fox had a fox",
"the spanner was a tool to tool",
"the football player played football",
"the house had a house"]})
过去几天我一直在寻找解决方案,但我只能找到为整个语料库找到顶部 TF-IDF 单词的示例,而不是基于整个语料库的数据框中的每一行.
谁能指引我正确的方向?
这是一个可能的解决方案:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
n = 3 # top n TF-IDF words
tfidf = TfidfVectorizer(token_pattern=r"\w+") # no words are left out
X = tfidf.fit_transform(df['Text'])
ind = (-X.todense()).argpartition(n)[:, :n]
top_words = pd.Series(
map(
lambda words_values: dict(zip(*words_values)),
zip(
np.array(tfidf.get_feature_names())[ind],
np.asarray(np.take_along_axis(X, ind, axis=1).todense()),
),
),
)
结果如下:
>>> top_words
0 {'cow': 0.7111977362687212, 'other': 0.3555988681343606, 'off': 0.3555988681343606}
1 {'fox': 0.8665817814049075, 'had': 0.34957636239744133, 'a': 0.2901799593148741}
2 {'tool': 0.7218960199361867, 'was': 0.36094800996809334, 'spanner': 0.36094800996809334}
3 {'football': 0.8014723840888909, 'player': 0.40073619204444544, 'played': 0.40073619204444544}
4 {'house': 0.8665817814049075, 'had': 0.34957636239744133, 'a': 0.2901799593148741}
我有一个包含文本列的数据框。
我想创建一个新列,在每行中包含 tuple/list 个最高 'n' TF-IDF 评分词,作为总结文本内容的一种方式。
一个示例数据框(非常简洁)是:
df = pd.DataFrame({'Ref': [1,2,3,4,5], 'Text': ["the cow jumped off the other cow",
"the fox had a fox",
"the spanner was a tool to tool",
"the football player played football",
"the house had a house"]})
过去几天我一直在寻找解决方案,但我只能找到为整个语料库找到顶部 TF-IDF 单词的示例,而不是基于整个语料库的数据框中的每一行.
谁能指引我正确的方向?
这是一个可能的解决方案:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
n = 3 # top n TF-IDF words
tfidf = TfidfVectorizer(token_pattern=r"\w+") # no words are left out
X = tfidf.fit_transform(df['Text'])
ind = (-X.todense()).argpartition(n)[:, :n]
top_words = pd.Series(
map(
lambda words_values: dict(zip(*words_values)),
zip(
np.array(tfidf.get_feature_names())[ind],
np.asarray(np.take_along_axis(X, ind, axis=1).todense()),
),
),
)
结果如下:
>>> top_words
0 {'cow': 0.7111977362687212, 'other': 0.3555988681343606, 'off': 0.3555988681343606}
1 {'fox': 0.8665817814049075, 'had': 0.34957636239744133, 'a': 0.2901799593148741}
2 {'tool': 0.7218960199361867, 'was': 0.36094800996809334, 'spanner': 0.36094800996809334}
3 {'football': 0.8014723840888909, 'player': 0.40073619204444544, 'played': 0.40073619204444544}
4 {'house': 0.8665817814049075, 'had': 0.34957636239744133, 'a': 0.2901799593148741}