将术语频率值插入字典
Inserting term frequency value to dictionary
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',]
def computeTF(corpus):
tfDict={}
for row in range(0,len(corpus)):
number_of_words=dict(Counter(corpus[row].split()))
for word,count in number_of_words.items():
tfDict[word]=count/len(corpus[row].split())
return tfDict
tfValue = computeTF(corpus)
print(tfValue)
我正在计算语料库中每个词的词频。在计算完所有值后,我将这些值添加到 tfDict 并返回它。但是每个单词的值都没有正确返回。到底出了什么问题?
当前值{'this':0.2,'is':0.2,'the':0.2,'first':0.2,'document':0.2,'second':0.16666666666666666,'and':0.16666666666666666,'third':0.16666666666666666,'one':0.16666666666666666}
期望值:-
{'this': 0.2, 'is': 0.2, 'the': 0.2, 'first': 0.2,'document':0.2},
{'this':0.16,'document':0.33,'is':0.16,'the':0.16,'second':0.16,'document':0.33},
{'and':0.16,'this',:0.16,'is':0.16,'the':0.16,'third':0.16,'one':0.16} ,{'is':0.2,'this':0.2,'the':0.2,'first':0.2,'document':0.2}
据我所知,您需要此代码
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document', ]
def computeTF(corpus):
tfDict = {}
for line in corpus:
tfDict[line] = {}
line_words = line.split()
for word in line_words:
tfDict[line][word] = line_words.count(word)/len(line_words)
return tfDict
print(computeTF(corpus))
据我了解,您在循环中定义字典会重新分配单词的 TF 值,因此您需要为每个文档创建一个计数器。
试试下面的代码:
def computeTF(corpus):
tfDict = {}
bowCount=0
Document = 0
#import ipdb ; ipdb.set_trace()
for row in range(0,len(corpus)):
number_of_words=dict(Counter(corpus[row].split()))
#print(number_of_words)
for word,count in number_of_words.items():
tfDict["%s in Corpus-%s"%(word,Document)]=count/len(corpus[row].split())
Document += 1
return tfDict
tfValue = computeTF(corpus)
print(tfValue)
输出:
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',]
def computeTF(corpus):
tfDict={}
for row in range(0,len(corpus)):
number_of_words=dict(Counter(corpus[row].split()))
for word,count in number_of_words.items():
tfDict[word]=count/len(corpus[row].split())
return tfDict
tfValue = computeTF(corpus)
print(tfValue)
我正在计算语料库中每个词的词频。在计算完所有值后,我将这些值添加到 tfDict 并返回它。但是每个单词的值都没有正确返回。到底出了什么问题?
当前值{'this':0.2,'is':0.2,'the':0.2,'first':0.2,'document':0.2,'second':0.16666666666666666,'and':0.16666666666666666,'third':0.16666666666666666,'one':0.16666666666666666}
期望值:- {'this': 0.2, 'is': 0.2, 'the': 0.2, 'first': 0.2,'document':0.2}, {'this':0.16,'document':0.33,'is':0.16,'the':0.16,'second':0.16,'document':0.33}, {'and':0.16,'this',:0.16,'is':0.16,'the':0.16,'third':0.16,'one':0.16} ,{'is':0.2,'this':0.2,'the':0.2,'first':0.2,'document':0.2}
据我所知,您需要此代码
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document', ]
def computeTF(corpus):
tfDict = {}
for line in corpus:
tfDict[line] = {}
line_words = line.split()
for word in line_words:
tfDict[line][word] = line_words.count(word)/len(line_words)
return tfDict
print(computeTF(corpus))
据我了解,您在循环中定义字典会重新分配单词的 TF 值,因此您需要为每个文档创建一个计数器。
试试下面的代码:
def computeTF(corpus):
tfDict = {}
bowCount=0
Document = 0
#import ipdb ; ipdb.set_trace()
for row in range(0,len(corpus)):
number_of_words=dict(Counter(corpus[row].split()))
#print(number_of_words)
for word,count in number_of_words.items():
tfDict["%s in Corpus-%s"%(word,Document)]=count/len(corpus[row].split())
Document += 1
return tfDict
tfValue = computeTF(corpus)
print(tfValue)
输出: