计算一个词出现在多少文档中
Counting in how many documents does a word appear
我正在尝试在没有 sklearn 的情况下实现 TFIDF 向量化器。我想计算出现一个单词的文档(字符串列表)的数量,以此类推该语料库中的所有单词。
示例:
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
期望的 OP: {this : 4, is : 4}
依此类推每个单词
我的代码:
def docs(corpus):
doc_count = dict()
for line in corpus:
for word in line.split():
if word in line:
doc_count[word] +=1
else:
doc_count[word] = 1
print(counts)
docs(corpus)
我遇到的错误:
KeyError Traceback (most recent call last)
<ipython-input-70-6bf2b69708bc> in <module>
9 print(counts)
10
---> 11 docs(corpus)
<ipython-input-70-6bf2b69708bc> in docs(corpus)
4 for word in line.split():
5 if word in line.split():
----> 6 doc_count[word] +=1
7 else:
8 doc_count[word] = 1
KeyError: 'this'
如果我没有正确迭代,请让我知道我缺少的地方。谢谢!
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
def docs(corpus):
doc_count = dict()
for line in corpus:
for word in line.split():
#you did mistake here
if word in doc_count:
doc_count[word] +=1
else:
doc_count[word] = 1
return doc_count
ans=docs(corpus)
print(ans)
我正在尝试在没有 sklearn 的情况下实现 TFIDF 向量化器。我想计算出现一个单词的文档(字符串列表)的数量,以此类推该语料库中的所有单词。 示例:
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
期望的 OP: {this : 4, is : 4}
依此类推每个单词
我的代码:
def docs(corpus):
doc_count = dict()
for line in corpus:
for word in line.split():
if word in line:
doc_count[word] +=1
else:
doc_count[word] = 1
print(counts)
docs(corpus)
我遇到的错误:
KeyError Traceback (most recent call last)
<ipython-input-70-6bf2b69708bc> in <module>
9 print(counts)
10
---> 11 docs(corpus)
<ipython-input-70-6bf2b69708bc> in docs(corpus)
4 for word in line.split():
5 if word in line.split():
----> 6 doc_count[word] +=1
7 else:
8 doc_count[word] = 1
KeyError: 'this'
如果我没有正确迭代,请让我知道我缺少的地方。谢谢!
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
def docs(corpus):
doc_count = dict()
for line in corpus:
for word in line.split():
#you did mistake here
if word in doc_count:
doc_count[word] +=1
else:
doc_count[word] = 1
return doc_count
ans=docs(corpus)
print(ans)