TfidfVectorizer 导致添加空行和不正确的分数分配
TfidfVectorizer results in adding null rows and incorrect score assignment
问题:为什么 sklearn 的 TfidfVectorizer 传递附加到不存在的值的分数(即矢量化器创建空行)?此外,为什么分数与适当的属性不匹配?
管道: 从 SQL 数据库中引入文本数据,将文本拆分为双字母组并计算每个文档的频率和每个文档每个双字母组的 tf-idf,将结果加载回 SQL 数据库。
当前状态:
引入两列数据(数字,文本)。清理文本以生成第三列 cleanText:
number text cleanText
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
只删除一个单词的行:
data = data[data['cleanText'].str.contains(' ')]
分组,然后进行特征提取:
data_grouped = data.groupby('number')
word_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
tfidf_vectorizer = TfidfVectorizer()
nGrams = pd.DataFrame()
for id, group in data_grouped:
X = word_vectorizer.fit_transform(group['cleanText'])
Y = tfidf_vectorizer.fit_transform(group['cleanText'])
frequencies = sum(X).toarray()[0]
Y.todense()
tfidfscore = Y.toarray()[0]
results = pd.DataFrame(frequencies, columns=['frequency'])
results2 = pd.DataFrame(tfidfscore, columns=['tfidfscore'])
dfinner = pd.DataFrame(word_vectorizer.get_feature_names(), columns=['nGram'])
dfinner['id'] = id
results = results.join(dfinner)
results = results2.join(results)
nGrams = nGrams.append(results)
print(nGrams)
输出:
tfidfscore frequency nGram id
0 0.57735 1.0 farmer plants 123.0
1 0.57735 1.0 plants grain 123.0
2 0.57735 NaN NaN NaN
0 0.50000 1.0 farmer son 234.0
1 0.50000 1.0 go fishing 234.0
2 0.50000 1.0 son go 234.0
3 0.50000 NaN NaN NaN
0 0.57735 1.0 catches tuna 345.0
1 0.57735 1.0 fisher catches 345.0
2 0.57735 NaN NaN NaN
问题:
- 输出包括除 tfidfscore 之外的每一列都具有空值的新行
tfidfscore
似乎不匹配。似乎 0.5
分数应该与数字 (id) 123
和数字 345
相关联,因为每一行中都有两个双字母组(即每行 0.5 或 50% 的重要性)
为什么 TfidfVectorizer 添加这些行并错误地将分数分配给数字?这与索引有关吗?任何和所有见解将不胜感激!谢谢!
这是一个被我忽略的简单问题。 TfidfVectorizer 从未使用使其按预期工作的正确参数进行初始化。所以我只是更改了这一行:
tfidf_vectorizer = TfidfVectorizer()
为此:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
问题:为什么 sklearn 的 TfidfVectorizer 传递附加到不存在的值的分数(即矢量化器创建空行)?此外,为什么分数与适当的属性不匹配?
管道: 从 SQL 数据库中引入文本数据,将文本拆分为双字母组并计算每个文档的频率和每个文档每个双字母组的 tf-idf,将结果加载回 SQL 数据库。
当前状态:
引入两列数据(数字,文本)。清理文本以生成第三列 cleanText:
number text cleanText
0 123 The farmer plants grain farmer plants grain
1 234 The farmer and his son go fishing farmer son go fishing
2 345 The fisher catches tuna fisher catches tuna
只删除一个单词的行:
data = data[data['cleanText'].str.contains(' ')]
分组,然后进行特征提取:
data_grouped = data.groupby('number')
word_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
tfidf_vectorizer = TfidfVectorizer()
nGrams = pd.DataFrame()
for id, group in data_grouped:
X = word_vectorizer.fit_transform(group['cleanText'])
Y = tfidf_vectorizer.fit_transform(group['cleanText'])
frequencies = sum(X).toarray()[0]
Y.todense()
tfidfscore = Y.toarray()[0]
results = pd.DataFrame(frequencies, columns=['frequency'])
results2 = pd.DataFrame(tfidfscore, columns=['tfidfscore'])
dfinner = pd.DataFrame(word_vectorizer.get_feature_names(), columns=['nGram'])
dfinner['id'] = id
results = results.join(dfinner)
results = results2.join(results)
nGrams = nGrams.append(results)
print(nGrams)
输出:
tfidfscore frequency nGram id
0 0.57735 1.0 farmer plants 123.0
1 0.57735 1.0 plants grain 123.0
2 0.57735 NaN NaN NaN
0 0.50000 1.0 farmer son 234.0
1 0.50000 1.0 go fishing 234.0
2 0.50000 1.0 son go 234.0
3 0.50000 NaN NaN NaN
0 0.57735 1.0 catches tuna 345.0
1 0.57735 1.0 fisher catches 345.0
2 0.57735 NaN NaN NaN
问题:
- 输出包括除 tfidfscore 之外的每一列都具有空值的新行
tfidfscore
似乎不匹配。似乎0.5
分数应该与数字 (id)123
和数字345
相关联,因为每一行中都有两个双字母组(即每行 0.5 或 50% 的重要性)
为什么 TfidfVectorizer 添加这些行并错误地将分数分配给数字?这与索引有关吗?任何和所有见解将不胜感激!谢谢!
这是一个被我忽略的简单问题。 TfidfVectorizer 从未使用使其按预期工作的正确参数进行初始化。所以我只是更改了这一行:
tfidf_vectorizer = TfidfVectorizer()
为此:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')