计算一个词出现在多少文档中

Question

我正在尝试在没有 sklearn 的情况下实现 TFIDF 向量化器。我想计算出现一个单词的文档（字符串列表）的数量，以此类推该语料库中的所有单词。示例：

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

期望的 OP： {this : 4, is : 4} 依此类推每个单词

我的代码：

def docs(corpus):
    doc_count = dict()
    for line in corpus:
        for word in line.split():
            if word in line:
                doc_count[word] +=1
            else:
                doc_count[word] = 1
        print(counts)

docs(corpus)

我遇到的错误：

KeyError                                  Traceback (most recent call last)
<ipython-input-70-6bf2b69708bc> in <module>
      9         print(counts)
     10 
---> 11 docs(corpus)

<ipython-input-70-6bf2b69708bc> in docs(corpus)
      4         for word in line.split():
      5             if word in line.split():
----> 6                 doc_count[word] +=1
      7             else:
      8                 doc_count[word] = 1

KeyError: 'this'

如果我没有正确迭代，请让我知道我缺少的地方。谢谢！

Answer 1

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

def docs(corpus):
    doc_count = dict()
    for line in corpus:
        for word in line.split():
            #you did mistake here
            if word in doc_count:
                doc_count[word] +=1
            else:
                doc_count[word] = 1
    return doc_count    

ans=docs(corpus)
print(ans)

计算一个词出现在多少文档中

Counting in how many documents does a word appear

python

loops

nlp

tf-idf