Python: 使用带有 TF-IDF 的列表
Python: Using a list with TF-IDF
我有以下代码,目前将 'Tokens' 中的所有单词与 'df' 中的每个文档进行比较。有什么方法可以将预定义的单词列表与文档而不是 'Tokens' 进行比较。
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(norm=None)
list_contents =[]
for index, row in df.iterrows():
list_contents.append(' '.join(row.Tokens))
# list_contents = df.Content.values
tfidf_matrix = tfidf_vectorizer.fit_transform(list_contents)
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(),columns= [tfidf_vectorizer.get_feature_names()])
df_tfidf.head(10)
感谢任何帮助。谢谢!
不确定我是否理解正确,但如果你想让 Vectorizer 考虑固定的单词列表,你可以使用 vocabulary
参数。
my_words = ["foo","bar","baz"]
# set the vocabulary parameter with your list of words
tfidf_vectorizer = TfidfVectorizer(
norm=None,
vocabulary=my_words)
list_contents =[]
for index, row in df.iterrows():
list_contents.append(' '.join(row.Tokens))
# this matrix will have only 3 columns because we have forced
# the vectorizer to use just the words foo bar and baz
# so it'll ignore all other words in the documents.
tfidf_matrix = tfidf_vectorizer.fit_transform(list_contents)
我有以下代码,目前将 'Tokens' 中的所有单词与 'df' 中的每个文档进行比较。有什么方法可以将预定义的单词列表与文档而不是 'Tokens' 进行比较。
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(norm=None)
list_contents =[]
for index, row in df.iterrows():
list_contents.append(' '.join(row.Tokens))
# list_contents = df.Content.values
tfidf_matrix = tfidf_vectorizer.fit_transform(list_contents)
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(),columns= [tfidf_vectorizer.get_feature_names()])
df_tfidf.head(10)
感谢任何帮助。谢谢!
不确定我是否理解正确,但如果你想让 Vectorizer 考虑固定的单词列表,你可以使用 vocabulary
参数。
my_words = ["foo","bar","baz"]
# set the vocabulary parameter with your list of words
tfidf_vectorizer = TfidfVectorizer(
norm=None,
vocabulary=my_words)
list_contents =[]
for index, row in df.iterrows():
list_contents.append(' '.join(row.Tokens))
# this matrix will have only 3 columns because we have forced
# the vectorizer to use just the words foo bar and baz
# so it'll ignore all other words in the documents.
tfidf_matrix = tfidf_vectorizer.fit_transform(list_contents)