如何获得整个词的 TF IDF 分数 sentence.I 我能够获得每个单词的 TFIDF 分数

Question

我想计算每个句子的TFIDF分数。我能够计算句子中每个单词的 Tf-IDF 分数。

如何添加新列“tf-idf 分数”，它显示数据框中每个句子的 tf-idf 分数。

消息数据帧-

#TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.Higher the TF-IDF score,higher the relevance of word.
feature_names = cv.get_feature_names()

#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]

#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf_score"])
df.sort_values(by=["tfidf_score"],ascending=False)

Output-

Word          tfidf_score
lzglhlw     0.468806
nmbmp         0.333468
energysoar  0.320803
media         0.316627
lnboca      0.291699

df.head()

     message
0   aug post media php z m nmbmp lnboca d d z l lzglhlw d d http energysoar com mozilla compatible googlebot http www google com bot html
1   aug post al php z ae zbhf lnboca d d z lw d d http eventcollector com mozilla compatible googlebot http www google com bot html
2   aug post site tmp ctivrc php z ae zbhf lnboca d d z l npdguvdg wlw d d http eventcollector com mozilla compatible googlebot http www google com bot html
3   aug post goog es php z m nmbmp lnboca d d z lw d d http energysoar com mozilla compatible googlebot http www google com bot html
4   aug post robot php z ae zbhf lnboca d d z lw d d http eventcollector com mozilla compatible googlebot http www google com bot html

Answer 1

这 link 解决了上述问题。

https://medium.com/analytics-vidhya/demonstrating-calculation-of-tf-idf-from-sklearn-4f9526e7e78b

输出示例-

df['message'] = df['message'].str.replace(r'\b\w\b', '').str.replace(r'\s+', ' ') #Removing all single letter in message

df['tokens'] = [x.lower().split() for x in df['message']] 

tf = df.tokens.apply(lambda x: pd.Series(x).value_counts()).fillna(0)   
tf.sort_index(inplace=True, axis=1)

tf.loc['Total Words in each columns']= tf.sum(numeric_only=True, axis=0)
tf.loc[:,'Number of Words in each message'] = tf.sum(numeric_only=True, axis=1)
#tf.to_excel("tf_syslog.xlsx")  #Exporting TF Score to excel file
tf.head()


import numpy as np
idf = pd.Series([np.log((float(df.shape[0])+1)/(len([x for x in df.tokens.values if token in x])+1))+1 for token in tf.columns])
idf.index = tf.columns
pd.set_option("display.max_rows", None)
print(idf)

tfidf = tf.copy()
for col in tfidf.columns:
 tfidf[col] = tfidf[col]*idf[col]
tfidf.head()
tfidf["Total_TF-IDF Score"] = tfidf.sum(axis=1)

#tfidf.to_excel("syslog_message_tfidf_score.xlsx")

如何获得整个词的 TF IDF 分数 sentence.I 我能够获得每个单词的 TFIDF 分数

How to Get TF IDF Score for whole sentence.I am able to get TFIDF Score for each word

python

tf-idf

dataframe

pandas