如何获得最重要单词的 TF-IDF 分数?
How to get the TF-IDF scores as well for the most important words?
我正在使用 tf-idf 进行一个项目,我的数据框中有一列 (df['liststring']),其中包含来自我的预处理文本(没有标点符号、停用词等)各种文件。
我 运行 下面的代码,我得到了 tf-idf 值最高的前 10 个词,但我也想看看它们的分数。
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['liststring']).toarray()
vocab = tfidf.vocabulary_
reverse_vocab = {v:k for k,v in vocab.items()}
feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)
idx = X_tfidf.argsort(axis=1)
tfidf_max10 = idx[:,-10:]
df_tfidf['top10'] = [[reverse_vocab.get(item) for item in row] for row in tfidf_max10 ]
df_tfidf['top10']
0 [kind, pose, world, preventive, sufficient, ke...
1 [mode, california, diseases, evidence, zoonoti...
2 [researcher, commentary, allegranzi, say, mora...
3 [carry, mild, man, whatever, suffering, downpl...
4 [region, service, almost, wednesday, detect, f...
...
754 [americans, plan, year, black, online, shop, s...
755 [relate, manor, tuesday, death, portobello, ce...
756 [one, october, eight, exist, transmit, cluster...
757 [wolfe, shelter, county, resident, cupertino, ...
758 [firework, year, blasio, day, marching, reimag...
如果我们以第一行为例,而不是 [kind, pose, world, preventive, sufficient, ke...],我想让输出看起来像 [kind:0.2, pose:0.3, world:0.4, preventive:0.5, sufficient:0.6, ke...]
df_tfidf['top10'] = [[(reverse_vocab.get(item), X_tfidf[i, item]) for item in row]
for i, row in enumerate(tfidf_max10) ]
测试用例:
df = pd.DataFrame(
{'liststring': ['this is a cat', 'that is a dog', "a apple on the tree"]}
)
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['liststring']).toarray()
vocab = tfidf.vocabulary_
reverse_vocab = {v:k for k,v in vocab.items()}
feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)
idx = X_tfidf.argsort(axis=1)
tfidf_max2 = idx[:,-2:]
print ([[(reverse_vocab.get(item), X_tfidf[i, item]) for item in row]
for i, row in enumerate(tfidf_max2) ])
输出:
[[('cat', 0.6227660078332259), ('this', 0.6227660078332259)],
[('dog', 0.6227660078332259), ('that', 0.6227660078332259)],
[('the', 0.5), ('tree', 0.5)]]
我正在使用 tf-idf 进行一个项目,我的数据框中有一列 (df['liststring']),其中包含来自我的预处理文本(没有标点符号、停用词等)各种文件。
我 运行 下面的代码,我得到了 tf-idf 值最高的前 10 个词,但我也想看看它们的分数。
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['liststring']).toarray()
vocab = tfidf.vocabulary_
reverse_vocab = {v:k for k,v in vocab.items()}
feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)
idx = X_tfidf.argsort(axis=1)
tfidf_max10 = idx[:,-10:]
df_tfidf['top10'] = [[reverse_vocab.get(item) for item in row] for row in tfidf_max10 ]
df_tfidf['top10']
0 [kind, pose, world, preventive, sufficient, ke...
1 [mode, california, diseases, evidence, zoonoti...
2 [researcher, commentary, allegranzi, say, mora...
3 [carry, mild, man, whatever, suffering, downpl...
4 [region, service, almost, wednesday, detect, f...
...
754 [americans, plan, year, black, online, shop, s...
755 [relate, manor, tuesday, death, portobello, ce...
756 [one, october, eight, exist, transmit, cluster...
757 [wolfe, shelter, county, resident, cupertino, ...
758 [firework, year, blasio, day, marching, reimag...
如果我们以第一行为例,而不是 [kind, pose, world, preventive, sufficient, ke...],我想让输出看起来像 [kind:0.2, pose:0.3, world:0.4, preventive:0.5, sufficient:0.6, ke...]
df_tfidf['top10'] = [[(reverse_vocab.get(item), X_tfidf[i, item]) for item in row]
for i, row in enumerate(tfidf_max10) ]
测试用例:
df = pd.DataFrame(
{'liststring': ['this is a cat', 'that is a dog', "a apple on the tree"]}
)
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df['liststring']).toarray()
vocab = tfidf.vocabulary_
reverse_vocab = {v:k for k,v in vocab.items()}
feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)
idx = X_tfidf.argsort(axis=1)
tfidf_max2 = idx[:,-2:]
print ([[(reverse_vocab.get(item), X_tfidf[i, item]) for item in row]
for i, row in enumerate(tfidf_max2) ])
输出:
[[('cat', 0.6227660078332259), ('this', 0.6227660078332259)],
[('dog', 0.6227660078332259), ('that', 0.6227660078332259)],
[('the', 0.5), ('tree', 0.5)]]