计算一个词出现在多少文档中

Counting in how many documents does a word appear

我正在尝试在没有 sklearn 的情况下实现 TFIDF 向量化器。我想计算出现一个单词的文档(字符串列表)的数量,以此类推该语料库中的所有单词。 示例:

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

期望的 OP: {this : 4, is : 4} 依此类推每个单词

我的代码:

def docs(corpus):
    doc_count = dict()
    for line in corpus:
        for word in line.split():
            if word in line:
                doc_count[word] +=1
            else:
                doc_count[word] = 1
        print(counts)

docs(corpus)

我遇到的错误:

KeyError                                  Traceback (most recent call last)
<ipython-input-70-6bf2b69708bc> in <module>
      9         print(counts)
     10 
---> 11 docs(corpus)

<ipython-input-70-6bf2b69708bc> in docs(corpus)
      4         for word in line.split():
      5             if word in line.split():
----> 6                 doc_count[word] +=1
      7             else:
      8                 doc_count[word] = 1

KeyError: 'this'

如果我没有正确迭代,请让我知道我缺少的地方。谢谢!

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

def docs(corpus):
    doc_count = dict()
    for line in corpus:
        for word in line.split():
            #you did mistake here
            if word in doc_count:
                doc_count[word] +=1
            else:
                doc_count[word] = 1
    return doc_count    

ans=docs(corpus)
print(ans)