Python 字典："key." 中所有 "distinct first word" 的所有值的总和

Question

我有一本字典，其中包含两个词的组合 "keys"，以及某个数字的 "values." 示例：

bigram_counts = {(u',', u'which'): 1, (u'of', u'the'): 2, ('<UNK>', u'by'): 2, (u'in', '<UNK>'): 1, ('<UNK>', u'charge'): 1, (u'``', '<UNK>'): 2, (u'The', u'and'): 1, ('<UNK>', u'reports'): 1, (u'an', '<UNK>'): 1, (u'election', u'was'): 1, ('<UNK>', u'primary'): 2, (u'that', '<UNK>'): 1, (u'that', u'the'): 1, (u'and', u'Fulton'): 1, ('<UNK>', u'to'): 1, (u'primary', u'election'): 1, (u'had', u'been'): 1, (u'primary', u'which'): 1, (u'The', '<UNK>'): 1, (u'the', u'election'): 2, (u'irregularities', u'took'): 1, (u',', u'``'): 1, ('<UNK>', u'that'): 1, ('<UNK>', u'of'): 2, (u'the', u'City'): 2, (u'in', u'which'): 1, (u'jury', '<UNK>'): 1, ('<UNK>', u'.'): 2, ('<UNK>', u'the'): 1, (u'of', u"Atlanta's"): 1, ('<UNK>', u'jury'): 1, (u'had', '<UNK>'): 1, (u'election', '<UNK>'): 1, (u'Fulton', u'County'): 1, ('<UNK>', u'``'): 2, (u'of', '<UNK>'): 1, ('<UNK>', u'said'): 2, (u'place', u'.'): 1, ('<UNK>', u'and'): 1, (u'election', u','): 1, (u"Atlanta's", '<UNK>'): 1, (u'which', u'the'): 1, (u'been', '<UNK>'): 1, (u'charge', u'of'): 1, (u'County', '<UNK>'): 1, (u'by', u'Fulton'): 1, (u'reports', u'of'): 1, (u'manner', u'in'): 1, ('<UNK>', u'an'): 1, (u"''", u'in'): 1, (u'the', '<UNK>'): 2, (u'said', '<UNK>'): 1, (u'Fulton', '<UNK>'): 1, (u'The', u'jury'): 1, (u'Atlanta', u"''"): 1, (u'``', u'irregularities'): 1, (u'in', u'the'): 1, (u'took', u'place'): 1, (u'for', u'the'): 1, (u'irregularities', u"''"): 1, ('<S>', u'The'): 3, (u"''", u'that'): 1, (u'City', '<UNK>'): 1, (u'which', u'was'): 1, (u"''", u'for'): 1, (u'was', '<UNK>'): 2, (u'jury', u'had'): 1, (u'said', u'in'): 1, (u'by', '<UNK>'): 1, ('<UNK>', u"''"): 1, ('<UNK>', u'irregularities'): 1, (u'to', '<UNK>'): 1, (u'.', '</S>'): 3, (u'of', u'Atlanta'): 1, ('<UNK>', u','): 1, (u'City', u'of'): 1, (u'and', '<UNK>'): 1, (u'which', u'had'): 1, (u'the', u'manner'): 1, ('<UNK>', '<UNK>'): 12}

我想要 return 一个新词典，其中包含 "key," 中所有 "distinct first word" 的所有值的总和，第二个词可以是任何词。例子：在上面的 bigram_counts 中，有 4 个元素的键 "first word" 为“ u'of' ”，它们的值之和为 5.

我还有另一本词典，其中包含所有 "distinct words" 以帮助计算。示例：

unigram_counts = {u'and': 2, u'City': 2, u"Atlanta's": 1, u'primary': 2, u'an': 1, u"''": 3, u'election': 3, u'in': 3, '<UNK>': 35, u'said': 2, u'for': 1, u'had': 2, u',': 2, u'been': 1, u'.': 3, u'to': 1, u'charge': 1, u'which': 3, u'Atlanta': 1, u'was': 2, u'``': 3, u'jury': 2, u'that': 2, '<S>': 4, u'took': 1, u'The': 3, u'by': 2, u'Fulton': 2, u'of': 5, u'reports': 1, u'irregularities': 2, u'County': 1, u'place': 1, u'the': 7, '</S>': 1, u'manner': 1}

其实unigram_counts已经有我想要的金额了。但是，我需要计算 bigram_counts 的总和并将其与 unigram_counts.

的值相匹配

谢谢

Answer 1

您选择的存储数据的方法似乎不是执行此任务的最佳方法。当然，您可以实现它，但也许另一个存储系统会更方便。但是，如果你真的必须这样做，你应该尝试这样的事情：

unigram_counts = {}
for e in bigram_counts:
    if e[0] in unigram_counts:
        unigram_counts[e[0]] += bigram_counts[e[0]]
    else:
        unigram_counts[e[0]] = bigram_counts[e[0]]

（未测试，但就是这个想法）。

Answer 2

@Baruchel 是对的，这可能是处理此问题的错误结构，但无论如何：

unigram_counts = collections.defaultdict(int)
for (first, _), val in bigram_counts.iteritems():
    unigram_counts[first] += val

似乎有效

Python 字典："key." 中所有 "distinct first word" 的所有值的总和

Python dictionary: Sum of all the values for all the "distinct first word" in the "key."

python

text-processing

dictionary

nlp

python-2.7