如何转换数据帧中的 TfidfVectorizer() 输出
How to transform TfidfVectorizer() outputs in dataframes
我找到了关于模型和特定输出的答案 ()。太好了。我想知道如何转换数据框中的打印件:
'''
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = [
'I would like to check this document',
'How about one more document',
'Aim is to capture the key words from the corpus'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
top_n = 3
print('tf_idf scores: \n', sorted(list(zip(vectorizer.get_feature_names(),
X.sum(0).getA1())),
key=lambda x: x[1], reverse=True)[:top_n])
# tf_idf scores :
# [('document', 1.4736296010332683), ('check', 0.6227660078332259), ('like', 0.6227660078332259)]
print('idf values: \n', sorted(list(zip(feature_array,vectorizer.idf_,)),
key = lambda x: x[1], reverse=True)[:top_n])
# idf values:
# [('aim', 1.6931471805599454), ('capture', 1.6931471805599454), ('check', 1.6931471805599454)]
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
print('Frequency: \n', sorted(list(zip(vectorizer.get_feature_names(),
X.sum(0).getA1())),
key=lambda x: x[1], reverse=True)[:top_n])
'''
提前致谢!
下面为您提供了 DataFrame
,其中包含 tf_idf、idf 和频率,按 tf_idf 统计数据(降序)排序。
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = [
'I would like to check this document',
'How about one more document',
'Aim is to capture the key words from the corpus'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
count_vectorizer = CountVectorizer(stop_words='english')
count_X = count_vectorizer.fit_transform(corpus)
count_feature_array = count_vectorizer.get_feature_names()
frequencies = (count_X.sum(0).getA1()[np.where(count_vectorizer.get_feature_names_out() == w)[0][0]] for w in vectorizer.get_feature_names_out())
df = pd.DataFrame({'word': vectorizer.get_feature_names_out(),
'tf_idf': X.sum(0).getA1(),
'idf': vectorizer.idf_,
'freqs': frequencies}).set_index('word').sort_values('tf_idf', ascending=False)
print(df)
# Prints:
tf_idf idf freqs
word
document 1.473630 1.287682 2
check 0.622766 1.693147 1
like 0.622766 1.693147 1
aim 0.447214 1.693147 1
capture 0.447214 1.693147 1
corpus 0.447214 1.693147 1
key 0.447214 1.693147 1
words 0.447214 1.693147 1
如果您只想要 tf_idf 统计中的前 n 个词,您可以这样做:
top_n = 3
print(df[:top_n])
# Prints:
tf_idf idf freqs
word
document 1.473630 1.287682 2
check 0.622766 1.693147 1
like 0.622766 1.693147 1
我找到了关于模型和特定输出的答案 (
'''
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = [
'I would like to check this document',
'How about one more document',
'Aim is to capture the key words from the corpus'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
top_n = 3
print('tf_idf scores: \n', sorted(list(zip(vectorizer.get_feature_names(),
X.sum(0).getA1())),
key=lambda x: x[1], reverse=True)[:top_n])
# tf_idf scores :
# [('document', 1.4736296010332683), ('check', 0.6227660078332259), ('like', 0.6227660078332259)]
print('idf values: \n', sorted(list(zip(feature_array,vectorizer.idf_,)),
key = lambda x: x[1], reverse=True)[:top_n])
# idf values:
# [('aim', 1.6931471805599454), ('capture', 1.6931471805599454), ('check', 1.6931471805599454)]
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
feature_array = vectorizer.get_feature_names()
print('Frequency: \n', sorted(list(zip(vectorizer.get_feature_names(),
X.sum(0).getA1())),
key=lambda x: x[1], reverse=True)[:top_n])
'''
提前致谢!
下面为您提供了 DataFrame
,其中包含 tf_idf、idf 和频率,按 tf_idf 统计数据(降序)排序。
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = [
'I would like to check this document',
'How about one more document',
'Aim is to capture the key words from the corpus'
]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
count_vectorizer = CountVectorizer(stop_words='english')
count_X = count_vectorizer.fit_transform(corpus)
count_feature_array = count_vectorizer.get_feature_names()
frequencies = (count_X.sum(0).getA1()[np.where(count_vectorizer.get_feature_names_out() == w)[0][0]] for w in vectorizer.get_feature_names_out())
df = pd.DataFrame({'word': vectorizer.get_feature_names_out(),
'tf_idf': X.sum(0).getA1(),
'idf': vectorizer.idf_,
'freqs': frequencies}).set_index('word').sort_values('tf_idf', ascending=False)
print(df)
# Prints:
tf_idf idf freqs
word
document 1.473630 1.287682 2
check 0.622766 1.693147 1
like 0.622766 1.693147 1
aim 0.447214 1.693147 1
capture 0.447214 1.693147 1
corpus 0.447214 1.693147 1
key 0.447214 1.693147 1
words 0.447214 1.693147 1
如果您只想要 tf_idf 统计中的前 n 个词,您可以这样做:
top_n = 3
print(df[:top_n])
# Prints:
tf_idf idf freqs
word
document 1.473630 1.287682 2
check 0.622766 1.693147 1
like 0.622766 1.693147 1