将术语频率值插入字典

Question

corpus = [
 'this is the first document',
 'this document is the second document',
 'and this is the third one',
 'is this the first document',]

def computeTF(corpus):
tfDict={}
for row in range(0,len(corpus)):
    number_of_words=dict(Counter(corpus[row].split()))
    for word,count in number_of_words.items():
        tfDict[word]=count/len(corpus[row].split())
return tfDict

tfValue = computeTF(corpus)
print(tfValue)

我正在计算语料库中每个词的词频。在计算完所有值后，我将这些值添加到 tfDict 并返回它。但是每个单词的值都没有正确返回。到底出了什么问题？

当前值{'this':0.2,'is':0.2,'the':0.2,'first':0.2,'document':0.2,'second'：0.16666666666666666，'and'：0.16666666666666666，'third'：0.16666666666666666，'one'：0.16666666666666666}

期望值：- {'this': 0.2, 'is': 0.2, 'the': 0.2, 'first': 0.2,'document':0.2}, {'this':0.16,'document':0.33,'is':0.16,'the':0.16,'second':0.16,'document':0.33}, {'and':0.16,'this',:0.16,'is':0.16,'the':0.16,'third':0.16,'one':0.16} ,{'is':0.2,'this':0.2,'the':0.2,'first':0.2,'document':0.2}

Answer 1

据我所知，您需要此代码

corpus = [
    'this is the first document',
    'this document is the second document',
    'and this is the third one',
    'is this the first document', ]

def computeTF(corpus):
    tfDict = {}
    for line in corpus:
        tfDict[line] = {}
        line_words = line.split()
        for word in line_words:
            tfDict[line][word] = line_words.count(word)/len(line_words)
    return tfDict

print(computeTF(corpus))

Answer 2

据我了解，您在循环中定义字典会重新分配单词的 TF 值，因此您需要为每个文档创建一个计数器。

试试下面的代码：

def computeTF(corpus):
    tfDict = {}
    bowCount=0
    Document = 0
    #import ipdb ; ipdb.set_trace()
    for row in range(0,len(corpus)):
        number_of_words=dict(Counter(corpus[row].split()))
        #print(number_of_words)
        for word,count in number_of_words.items():
            tfDict["%s in Corpus-%s"%(word,Document)]=count/len(corpus[row].split())
        Document += 1
    return tfDict

tfValue = computeTF(corpus)
print(tfValue)

输出：

将术语频率值插入字典

Inserting term frequency value to dictionary

python

dictionary

tf-idf