TfidfVectorizer 导致添加空行和不正确的分数分配

TfidfVectorizer results in adding null rows and incorrect score assignment

问题:为什么 sklearn 的 TfidfVectorizer 传递附加到不存在的值的分数(即矢量化器创建空行)?此外,为什么分数与适当的属性不匹配?

管道: 从 SQL 数据库中引入文本数据,将文本拆分为双字母组并计算每个文档的频率和每个文档每个双字母组的 tf-idf,将结果加载回 SQL 数据库。

当前状态:

引入两列数据(数字,文本)。清理文本以生成第三列 cleanText:

   number                               text              cleanText
0     123            The farmer plants grain    farmer plants grain
1     234  The farmer and his son go fishing  farmer son go fishing
2     345            The fisher catches tuna    fisher catches tuna

只删除一个单词的行:

data = data[data['cleanText'].str.contains(' ')]

分组,然后进行特征提取:

data_grouped = data.groupby('number')

word_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
tfidf_vectorizer = TfidfVectorizer()

nGrams = pd.DataFrame()

for id, group in data_grouped:
       X = word_vectorizer.fit_transform(group['cleanText'])
       Y = tfidf_vectorizer.fit_transform(group['cleanText'])
       frequencies = sum(X).toarray()[0]
       Y.todense()
       tfidfscore = Y.toarray()[0]
       results = pd.DataFrame(frequencies, columns=['frequency'])
       results2 = pd.DataFrame(tfidfscore, columns=['tfidfscore'])
       dfinner = pd.DataFrame(word_vectorizer.get_feature_names(), columns=['nGram'])
       dfinner['id'] = id
       results = results.join(dfinner)
       results = results2.join(results)
       nGrams = nGrams.append(results)


print(nGrams)

输出:

   tfidfscore  frequency           nGram     id
0     0.57735        1.0   farmer plants  123.0
1     0.57735        1.0    plants grain  123.0
2     0.57735        NaN             NaN    NaN
0     0.50000        1.0      farmer son  234.0
1     0.50000        1.0      go fishing  234.0
2     0.50000        1.0          son go  234.0
3     0.50000        NaN             NaN    NaN
0     0.57735        1.0    catches tuna  345.0
1     0.57735        1.0  fisher catches  345.0
2     0.57735        NaN             NaN    NaN

问题:

  1. 输出包括除 tfidfscore 之外的每一列都具有空值的新行
  2. tfidfscore 似乎不匹配。似乎 0.5 分数应该与数字 (id) 123 和数字 345 相关联,因为每一行中都有两个双字母组(即每行 0.5 或 50% 的重要性)

为什么 TfidfVectorizer 添加这些行并错误地将分数分配给数字?这与索引有关吗?任何和所有见解将不胜感激!谢谢!

这是一个被我忽略的简单问题。 TfidfVectorizer 从未使用使其按预期工作的正确参数进行初始化。所以我只是更改了这一行:

tfidf_vectorizer = TfidfVectorizer()

为此:

tfidf_vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')