尝试复制 TFIDF 示例,乘法 returns 错误的数字

Trying to replicate TFIDF example, multiplication returns wrong number

我正在尝试复制此视频中的 TFIDF 示例:Using TF-IDF to convert unstructured text to useful features

据我所知,代码与示例中的相同,除了我使用 .items (python 3) 而不是 .iteritems (python 2):

docA = "the cat sat on my face"
docB = "the dog sat on my bed"

bowA = docA.split(" ")
bowB = docB.split(" ")

wordSet= set(bowA).union(set(bowB))

wordDictA = dict.fromkeys(wordSet, 0)
wordDictB = dict.fromkeys(wordSet, 0)

for word in bowA:
        wordDictA[word]+=1

for word in bowB:
        wordDictB[word]+=1

import pandas as pd

bag = pd.DataFrame([wordDictA, wordDictB])

print(bag)

def computeTF(wordDict,bow):
        tfDict = {}
        bowCount = len(bow)
        for word, count in wordDict.items():
                tfDict[word] = count / float(bowCount)
        return tfDict

tfBowA = computeTF(wordDictA, bowA)
tfBowB = computeTF(wordDictB, bowB)

def computeIDF(docList):
        import math
        idfDict = {}
        N = len(docList)
        #Count N of docs that contain word w
        idfDict = dict.fromkeys(docList[0].keys(),0)
        for doc in docList:
                for word, val in doc.items():
                        if val > 0:
                                idfDict[word] +=1
        for word, val in idfDict.items():
                idfDict[word] = math.log(N/ float(val))
        return idfDict

idfs = computeIDF([wordDictA, wordDictB])

def computeTFIDF(tfBow,idfs):
        tfidf = {}
        for word, val in tfBow.items():
                tfidf[word] = val * idfs[word]
        return tfidf

tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)

TF = pd.DataFrame([tfidfBowA, tfidfBowB])

print(TF)

结果 table 应该看起来像这样,其中常用词(on、my、sat、the)的得分均为 0:

         bed       cat       dog      face        my        on       sat       the   
0  0.000000  0.115525  0.000000  0.115525  0.000000  0.000000  0.000000  0.000000   
1  0.115525  0.000000  0.115525  0.000000  0.000000  0.000000  0.000000  0.000000 

但是我得到的数据框看起来像这样,除了那些刚刚出现在文档中的单词(bed\dog,cat\face):

         bed       cat       dog      face        my        on       sat       the   
0  0.000000  0.020833  0.000000  0.020833  0.020833  0.020833  0.020833  0.020833   
1  0.020833  0.000000  0.020833  0.000000  0.020833  0.020833  0.020833  0.020833 

如果我打印 (idfs) 我得到

{'my': 0.0, 'sat': 0.0, 'dog': 0.6931, 'cat': 0.6931, 'on': 0.0, 'the': 0.0, 'face': 0.6931, 'bed': 0.6931}

在这里,两个文档中包含的词的值为 0,然后将使用该值来衡量它们的重要性,因为它们对所有文档都是通用的。在使用computeTFIDF函数之前,数据是这样的:

{'my': 0.1666, 'sat': 0.1666, 'dog': 0.0, 'cat': 0.1666, 'on': 0.1666, 'the': 0.1666, 'face': 0.1666, 'bed': 0.0}

由于函数将两个数相乘,"my"(idfs 为 0)应为 0,"dog"(idfs 为 0.6931)应为 (0,6931* 0,1666 = 0,11),如示例所示。相反,除了文档中不存在的单词外,我得到的所有数字都是 0.02083。除了 python 2 和 3 之间的 iter\iteritems 的语法之外,还有其他东西弄乱了我的代码吗?

在转换为 df 之前的倒数第二部分,更改这两行 -

tfidfBowA = computeTF(tfBowA, idfs)
tfidfBowB = computeTF(tfBowB, idfs)

至 -

tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

为了计算 Tfidf,您必须调用函数 computeTFIDF() 而不是 computeTF()

输出

tfidfBowA
{'bed': 0.0,
 'cat': 0.11552453009332421,
 'dog': 0.0,
 'face': 0.11552453009332421,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

tfidfBowB
{'bed': 0.11552453009332421,
 'cat': 0.0,
 'dog': 0.11552453009332421,
 'face': 0.0,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0,
 'the': 0.0}

希望对您有所帮助!