在不使用包 ADT 上的计数器 class 的情况下计算词频的最简单方法

Question

我有一些代码可以很好地使用计数器 class 导入计算所选列表中的术语频率。

from collections import Counter

terms=['the', 'fox', 'the', 'quick', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

tf = Counter(terms)

print(tf)

现有代码运行良好，但我想知道在没有 python 计数器 class 的帮助下严格使用 bag/multiset ADT 实现相同结果的最精简方法是什么.

我花了几天时间试验代码并在其他论坛上寻找，但都没有成功。

Answer 1

您可以使用单个词典理解：

terms=['the', 'fox', 'the', 'quick', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
new_terms = {term:terms.count(term) for term in terms}

输出：

{'lazy': 1, 'over': 1, 'fox': 2, 'dog': 1, 'quick': 1, 'the': 3, 'jumps': 1}

使用 multiset:

import itertools
import multiset
final_data = [multiset.Multiset(list(b)) for a, b in itertools.groupby(sorted(terms))]

输出：

[Multiset({'dog': 1}), Multiset({'fox': 2}), Multiset({'jumps': 1}), Multiset({'lazy': 1}), Multiset({'over': 1}), Multiset({'quick': 1}), Multiset({'the': 3})]

Answer 2

您可以使用通用的 dict 并循环使用 get 和默认值更新计数：

tf = {}
for t in terms:
    tf[t] = tf.get(t, 0) + 1

在不使用包 ADT 上的计数器 class 的情况下计算词频的最简单方法

Leanest way to compute term frequency without using the counter class on a bag ADT

python

algorithm

multiset