哪 10 个词在每个文档/总数中具有最高的 TF-IDF 值?
Which 10 words has the highest TF-IDF value in each document / total?
我正在尝试为每个文档获取 TF-IDF 得分最高的 10 个单词。
我的数据框中有一列包含来自我的各种文档的预处理文本(没有标点符号、停用词等)。在此示例中,一行表示一个文档。
它有 500 多行,我很好奇每一行中最重要的词。
所以我运行下面的代码:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(df['liststring'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df2 = pd.DataFrame(denselist, columns=feature_names)
这给了我一个 TF-IDF 矩阵:
我的问题是,如何收集TF-IDF值最高的前10个词?最好在我的原始数据框 (df) 中创建一个列,其中包含每一行的前 10 个词,但也知道哪些词是最重要的。
20newsgroups
数据集的最小可重现示例是:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
X,y = fetch_20newsgroups(return_X_y = True)
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(X).toarray()
vocab = tfidf.vocabulary_
reverse_vocab = {v:k for k,v in vocab.items()}
feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)
idx = X_tfidf.argsort(axis=1)
tfidf_max10 = idx[:,-10:]
df_tfidf['top10'] = [[reverse_vocab.get(item) for item in row] for row in tfidf_max10 ]
df_tfidf['top10']
0 [this, was, funky, rac3, bricklin, tellme, umd...
1 [1qvfo9innc3s, upgrade, experiences, carson, k...
2 [heard, anybody, 160, display, willis, powerbo...
3 [joe, green, csd, iastate, jgreen, amber, p900...
4 [tom, n3p, c5owcb, expected, std, launch, jona...
...
11309 [millie, diagnosis, headache, factory, scan, j...
11310 [plus, jiggling, screen, bodin, blank, mac, wi...
11311 [weight, ended, vertical, socket, the, westes,...
11312 [central, steven, steve, collins, bolson, hcrl...
11313 [california, kjg, 2101240, willow, jh2sc281xpm...
Name: top10, Length: 11314, dtype: object
要获取 TfIdf 最高的前 10 个特征,请使用:
global_top10_idx = X_tfidf.max(axis=0).argsort()[-10:]
np.asarray(feature_names)[global_top10_idx]
有什么不明白的请追问
我正在尝试为每个文档获取 TF-IDF 得分最高的 10 个单词。
我的数据框中有一列包含来自我的各种文档的预处理文本(没有标点符号、停用词等)。在此示例中,一行表示一个文档。
它有 500 多行,我很好奇每一行中最重要的词。
所以我运行下面的代码:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(df['liststring'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df2 = pd.DataFrame(denselist, columns=feature_names)
这给了我一个 TF-IDF 矩阵:
我的问题是,如何收集TF-IDF值最高的前10个词?最好在我的原始数据框 (df) 中创建一个列,其中包含每一行的前 10 个词,但也知道哪些词是最重要的。
20newsgroups
数据集的最小可重现示例是:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
X,y = fetch_20newsgroups(return_X_y = True)
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(X).toarray()
vocab = tfidf.vocabulary_
reverse_vocab = {v:k for k,v in vocab.items()}
feature_names = tfidf.get_feature_names()
df_tfidf = pd.DataFrame(X_tfidf, columns = feature_names)
idx = X_tfidf.argsort(axis=1)
tfidf_max10 = idx[:,-10:]
df_tfidf['top10'] = [[reverse_vocab.get(item) for item in row] for row in tfidf_max10 ]
df_tfidf['top10']
0 [this, was, funky, rac3, bricklin, tellme, umd...
1 [1qvfo9innc3s, upgrade, experiences, carson, k...
2 [heard, anybody, 160, display, willis, powerbo...
3 [joe, green, csd, iastate, jgreen, amber, p900...
4 [tom, n3p, c5owcb, expected, std, launch, jona...
...
11309 [millie, diagnosis, headache, factory, scan, j...
11310 [plus, jiggling, screen, bodin, blank, mac, wi...
11311 [weight, ended, vertical, socket, the, westes,...
11312 [central, steven, steve, collins, bolson, hcrl...
11313 [california, kjg, 2101240, willow, jh2sc281xpm...
Name: top10, Length: 11314, dtype: object
要获取 TfIdf 最高的前 10 个特征,请使用:
global_top10_idx = X_tfidf.max(axis=0).argsort()[-10:]
np.asarray(feature_names)[global_top10_idx]
有什么不明白的请追问