我如何使用字典理解来计算文档中每个单词的出现次数

Question

我在 python 中有一个完整的文本列表。这就像从每个文档中设置单词。因此，对于每个文档，我都有一个列表，然后是所有文档的列表。

所有列表只包含独特的单词。 我的目的是计算整个文档中每个单词的出现次数。我可以使用以下代码成功完成此操作：

for x in texts_list:
    for l in x:
        if l in term_appearance:
            term_appearance[l] += 1
        else:
            term_appearance[l] = 1

但我想用字典理解来做同样的事情。这是第一次，我正在尝试编写字典理解并使用 Whosebug 中以前存在的 posts，我已经能够编写以下内容：

from collections import defaultdict
term_appearance = defaultdict(int)

{{term_appearance[l] : term_appearance[l] + 1 if l else term_appearance[l] : 1 for l in x} for x in texts_list}

上一篇post供参考：

Simple syntax error in Python if else dict comprehension

按照上面post的建议，我也使用了下面的代码：

{{l : term_appearance[l] + 1 if l else 1 for l in x} for x in texts_list}

上面的代码成功生成了空列表，但最终抛出了以下回溯：

[]

[]

[]

[]

Traceback (most recent call last):

  File "term_count_fltr.py", line 28, in <module>

    {{l : term_appearance[l] + 1 if l else 1 for l in x} for x in texts_list}
  File "term_count_fltr.py", line 28, in <setcomp>

    {{l : term_appearance[l] + 1 if l else 1 for l in x} for x in texts_list}

TypeError: unhashable type: 'dict'

如果能帮助我提高目前的理解力，我们将不胜感激。

看上面的错误，我也试过了

[{l : term_appearance[l] + 1 if l else 1 for l in x} for x in texts_list]

这个运行没有任何错误，但输出只是空列表。

Answer 1

您收到不可散列类型错误的原因是您不能将字典用作 Python 中另一个字典的键，因为它们是可变容器。

参见：why dict objects are unhashable in python?

Answer 2

Python 2.7+ 中的词典推导并不像您认为的那样工作。

像列表理解一样，它们创建了一个 new 字典，但你不能使用它们将键添加到 already existing 字典（在这种情况下就是你想要做的）。

Answer 3

如其他答案中所述，问题在于字典理解会创建一个新字典，因此在创建新字典之前您不会引用该新字典。你不能对你正在做的事情进行字典理解。

鉴于此，您正在做的是尝试重新实现 collections.Counter 已经完成的工作。您可以简单地使用 Counter 。例子-

from collections import Counter
term_appearance = Counter()
for x in texts_list:
    term_appearance.update(x)

演示 -

>>> l = [[1,2,3],[2,3,1],[5,4,2],[1,1,3]]
>>> from collections import Counter
>>> term_appearance = Counter()
>>> for x in l:
...     term_appearance.update(x)
...
>>> term_appearance
Counter({1: 4, 2: 3, 3: 3, 4: 1, 5: 1})

如果你真的想在某种理解中这样做，你可以这样做：

from collections import Counter
term_appearance = Counter()
[term_appearance.update(x) for x in texts_list]

演示 -

>>> l = [[1,2,3],[2,3,1],[5,4,2],[1,1,3]]
>>> from collections import Counter
>>> term_appearance = Counter()
>>> [term_appearance.update(x) for x in l]
[None, None, None, None]
>>> term_appearance
Counter({1: 4, 2: 3, 3: 3, 4: 1, 5: 1})

输出 [None, None, None, None] 来自生成该列表的列表理解（因为这是运行交互），如果您运行在脚本中将其作为 python <script> ，该输出将被简单地丢弃。

您还可以使用 itertools.chain.from_iterable() 从 text_lists 创建一个扁平列表，然后将其用于计数器。示例：

from collections import Counter
from itertools import chain
term_appearance = Counter(chain.from_iterable(texts_list))

演示 -

>>> from collections import Counter
>>> from itertools import chain
>>> term_appearance = Counter(chain.from_iterable(l))
>>> term_appearance
Counter({1: 4, 2: 3, 3: 3, 4: 1, 5: 1})

此外，原始代码中的另一个问题 -

{{term_appearance[l] : term_appearance[l] + 1 if l else term_appearance[l] : 1 for l in x} for x in texts_list}

这实际上是一个集合推导式，里面嵌套了一个字典推导式。

这就是您收到错误的原因 - TypeError: unhashable type: 'dict'。因为在首先运行ning 字典理解并创建 dict 之后，它试图将其添加到 set 中。但是字典不可哈希，因此会出现错误。

Answer 4

请仔细阅读 by Anand S Kumar if you want to use collections.Counter which is a great suggestion. However there is another solution related to using collections.defaultdict，我觉得值得一提：

from collections import defaultdict

text_appearances = defaultdict()

for x in texts_lists:
    for l in x:
        text_appearances[l] += 1

我曾多次使用过这种结构，我认为这是一种干净而漂亮的计数方式。特别是如果您出于某种原因需要在循环之间进行一些验证，这是一种直接更新计数的有效方法，而不必担心 key/word 是否已经存在于您的字典中（就像在您的第一个解决方案中一样）。

变量命名旁注：请不要使用小写l（L的小写）作为变量名，很难区分来自 1（第一名）。在你的情况下，也许你可以命名变量，words 和 word？加上不使用 _list 作为后缀，代码可以读作：

for words in texts:
    for word in words:
        text_appearance[word] += 1

我如何使用字典理解来计算文档中每个单词的出现次数

How can i count occurrence of each word in document using Dictionary comprehension

python

dictionary

list

python-2.7

dictionary-comprehension