使用 TfidfVectorizer 的词频
word frequency with TfidfVectorizer
我正在尝试使用 TF-IDF 计算消息数据帧的词频。到目前为止我有这个
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
new_group['tokenized_sents'] = new_group.apply(lambda row: nltk.word_tokenize(row['message']),axis=1).astype(str).lower()
vectoriser=TfidfVectorizer()
new_group['tokenized_vector'] = list(vectoriser.fit_transform(new_group['tokenized_sents']).toarray())
但是使用上面的代码我得到了一堆零而不是单词频率。我怎样才能解决这个问题以获得正确的消息号码频率。这是我的数据框
user_id date message tokenized_sents tokenized_vector
X35WQ0U8S 2019-02-17 Need help ['need','help'] [0.0,0.0]
X36WDMT2J 2019-03-22 Thank you! ['thank','you','!'] [0.0,0.0,0.0]
首先,对于计数,您不想使用 TfidfVectorizer,因为它已标准化。你想使用 CountVectorizer。其次,您不需要对单词进行分词,因为 sklearn 内置分词器,同时包含 TfidfVectorizer 和 CountVectorizer。
#add whatever settings you want
countVec =CountVectorizer()
#fit transform
cv = countVec.fit_transform(df['message'].str.lower())
#feature names
cv_feature_names = countVec.get_feature_names()
#feature counts
feature_count = cv.toarray().sum(axis = 0)
#feature name to count
dict(zip(cv_feature_names, feature_count))
我正在尝试使用 TF-IDF 计算消息数据帧的词频。到目前为止我有这个
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
new_group['tokenized_sents'] = new_group.apply(lambda row: nltk.word_tokenize(row['message']),axis=1).astype(str).lower()
vectoriser=TfidfVectorizer()
new_group['tokenized_vector'] = list(vectoriser.fit_transform(new_group['tokenized_sents']).toarray())
但是使用上面的代码我得到了一堆零而不是单词频率。我怎样才能解决这个问题以获得正确的消息号码频率。这是我的数据框
user_id date message tokenized_sents tokenized_vector
X35WQ0U8S 2019-02-17 Need help ['need','help'] [0.0,0.0]
X36WDMT2J 2019-03-22 Thank you! ['thank','you','!'] [0.0,0.0,0.0]
首先,对于计数,您不想使用 TfidfVectorizer,因为它已标准化。你想使用 CountVectorizer。其次,您不需要对单词进行分词,因为 sklearn 内置分词器,同时包含 TfidfVectorizer 和 CountVectorizer。
#add whatever settings you want
countVec =CountVectorizer()
#fit transform
cv = countVec.fit_transform(df['message'].str.lower())
#feature names
cv_feature_names = countVec.get_feature_names()
#feature counts
feature_count = cv.toarray().sum(axis = 0)
#feature name to count
dict(zip(cv_feature_names, feature_count))