尝试复制 TFIDF 示例,乘法 returns 错误的数字
Trying to replicate TFIDF example, multiplication returns wrong number
我正在尝试复制此视频中的 TFIDF 示例:Using TF-IDF to convert unstructured text to useful features
据我所知,代码与示例中的相同,除了我使用 .items (python 3) 而不是 .iteritems (python 2):
docA = "the cat sat on my face"
docB = "the dog sat on my bed"
bowA = docA.split(" ")
bowB = docB.split(" ")
wordSet= set(bowA).union(set(bowB))
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)
for word in bowA:
wordDictA[word]+=1
for word in bowB:
wordDictB[word]+=1
import pandas as pd
bag = pd.DataFrame([wordDictA, wordDictB])
print(bag)
def computeTF(wordDict,bow):
tfDict = {}
bowCount = len(bow)
for word, count in wordDict.items():
tfDict[word] = count / float(bowCount)
return tfDict
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)
def computeIDF(docList):
import math
idfDict = {}
N = len(docList)
#Count N of docs that contain word w
idfDict = dict.fromkeys(docList[0].keys(),0)
for doc in docList:
for word, val in doc.items():
if val > 0:
idfDict[word] +=1
for word, val in idfDict.items():
idfDict[word] = math.log(N/ float(val))
return idfDict
idfs = computeIDF([wordDictA, wordDictB])
def computeTFIDF(tfBow,idfs):
tfidf = {}
for word, val in tfBow.items():
tfidf[word] = val * idfs[word]
return tfidf
tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)
TF = pd.DataFrame([tfidfBowA, tfidfBowB])
print(TF)
结果 table 应该看起来像这样,其中常用词(on、my、sat、the)的得分均为 0:
bed cat dog face my on sat the
0 0.000000 0.115525 0.000000 0.115525 0.000000 0.000000 0.000000 0.000000
1 0.115525 0.000000 0.115525 0.000000 0.000000 0.000000 0.000000 0.000000
但是我得到的数据框看起来像这样,除了那些刚刚出现在文档中的单词(bed\dog,cat\face):
bed cat dog face my on sat the
0 0.000000 0.020833 0.000000 0.020833 0.020833 0.020833 0.020833 0.020833
1 0.020833 0.000000 0.020833 0.000000 0.020833 0.020833 0.020833 0.020833
如果我打印 (idfs) 我得到
{'my': 0.0, 'sat': 0.0, 'dog': 0.6931, 'cat': 0.6931, 'on': 0.0, 'the': 0.0, 'face': 0.6931, 'bed': 0.6931}
在这里,两个文档中包含的词的值为 0,然后将使用该值来衡量它们的重要性,因为它们对所有文档都是通用的。在使用computeTFIDF函数之前,数据是这样的:
{'my': 0.1666, 'sat': 0.1666, 'dog': 0.0, 'cat': 0.1666, 'on': 0.1666, 'the': 0.1666, 'face': 0.1666, 'bed': 0.0}
由于函数将两个数相乘,"my"(idfs 为 0)应为 0,"dog"(idfs 为 0.6931)应为 (0,6931* 0,1666 = 0,11),如示例所示。相反,除了文档中不存在的单词外,我得到的所有数字都是 0.02083。除了 python 2 和 3 之间的 iter\iteritems 的语法之外,还有其他东西弄乱了我的代码吗?
在转换为 df
之前的倒数第二部分,更改这两行 -
tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)
至 -
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)
为了计算 Tfidf
,您必须调用函数 computeTFIDF()
而不是 computeTF()
输出
tfidfBowA
{'bed': 0.0,
'cat': 0.11552453009332421,
'dog': 0.0,
'face': 0.11552453009332421,
'my': 0.0,
'on': 0.0,
'sat': 0.0,
'the': 0.0}
tfidfBowB
{'bed': 0.11552453009332421,
'cat': 0.0,
'dog': 0.11552453009332421,
'face': 0.0,
'my': 0.0,
'on': 0.0,
'sat': 0.0,
'the': 0.0}
希望对您有所帮助!
我正在尝试复制此视频中的 TFIDF 示例:Using TF-IDF to convert unstructured text to useful features
据我所知,代码与示例中的相同,除了我使用 .items (python 3) 而不是 .iteritems (python 2):
docA = "the cat sat on my face"
docB = "the dog sat on my bed"
bowA = docA.split(" ")
bowB = docB.split(" ")
wordSet= set(bowA).union(set(bowB))
wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)
for word in bowA:
wordDictA[word]+=1
for word in bowB:
wordDictB[word]+=1
import pandas as pd
bag = pd.DataFrame([wordDictA, wordDictB])
print(bag)
def computeTF(wordDict,bow):
tfDict = {}
bowCount = len(bow)
for word, count in wordDict.items():
tfDict[word] = count / float(bowCount)
return tfDict
tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)
def computeIDF(docList):
import math
idfDict = {}
N = len(docList)
#Count N of docs that contain word w
idfDict = dict.fromkeys(docList[0].keys(),0)
for doc in docList:
for word, val in doc.items():
if val > 0:
idfDict[word] +=1
for word, val in idfDict.items():
idfDict[word] = math.log(N/ float(val))
return idfDict
idfs = computeIDF([wordDictA, wordDictB])
def computeTFIDF(tfBow,idfs):
tfidf = {}
for word, val in tfBow.items():
tfidf[word] = val * idfs[word]
return tfidf
tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)
TF = pd.DataFrame([tfidfBowA, tfidfBowB])
print(TF)
结果 table 应该看起来像这样,其中常用词(on、my、sat、the)的得分均为 0:
bed cat dog face my on sat the
0 0.000000 0.115525 0.000000 0.115525 0.000000 0.000000 0.000000 0.000000
1 0.115525 0.000000 0.115525 0.000000 0.000000 0.000000 0.000000 0.000000
但是我得到的数据框看起来像这样,除了那些刚刚出现在文档中的单词(bed\dog,cat\face):
bed cat dog face my on sat the
0 0.000000 0.020833 0.000000 0.020833 0.020833 0.020833 0.020833 0.020833
1 0.020833 0.000000 0.020833 0.000000 0.020833 0.020833 0.020833 0.020833
如果我打印 (idfs) 我得到
{'my': 0.0, 'sat': 0.0, 'dog': 0.6931, 'cat': 0.6931, 'on': 0.0, 'the': 0.0, 'face': 0.6931, 'bed': 0.6931}
在这里,两个文档中包含的词的值为 0,然后将使用该值来衡量它们的重要性,因为它们对所有文档都是通用的。在使用computeTFIDF函数之前,数据是这样的:
{'my': 0.1666, 'sat': 0.1666, 'dog': 0.0, 'cat': 0.1666, 'on': 0.1666, 'the': 0.1666, 'face': 0.1666, 'bed': 0.0}
由于函数将两个数相乘,"my"(idfs 为 0)应为 0,"dog"(idfs 为 0.6931)应为 (0,6931* 0,1666 = 0,11),如示例所示。相反,除了文档中不存在的单词外,我得到的所有数字都是 0.02083。除了 python 2 和 3 之间的 iter\iteritems 的语法之外,还有其他东西弄乱了我的代码吗?
在转换为 df
之前的倒数第二部分,更改这两行 -
tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)
至 -
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)
为了计算 Tfidf
,您必须调用函数 computeTFIDF()
而不是 computeTF()
输出
tfidfBowA
{'bed': 0.0,
'cat': 0.11552453009332421,
'dog': 0.0,
'face': 0.11552453009332421,
'my': 0.0,
'on': 0.0,
'sat': 0.0,
'the': 0.0}
tfidfBowB
{'bed': 0.11552453009332421,
'cat': 0.0,
'dog': 0.11552453009332421,
'face': 0.0,
'my': 0.0,
'on': 0.0,
'sat': 0.0,
'the': 0.0}
希望对您有所帮助!