从单词和词性生成多级词典

Producing multi-level dictionary from word and part-of-speech

给定一些 Penn Treebank 以这种格式标记的文本:

"David/NNP Short/NNP will/MD chair/VB the/DT meeting/NN ./. The/DT boy/NN sits/VBZ on/IN the/DT chair/NN ./."

我想生成一个多级字典,以单词为键并计算它出现的频率,标记为每个 POS,所以我们有 ['Chair, VB : 1, NN : 1'、'The, DT : 3'、] 等.

我想我可以使用正则表达式来提取单词和相应的 POS。

r'[A+Za+z]+/' and r'/[A+Z]+'

但无法弄清楚如何将它们放在一起来为单词及其相应的 POS 出现创建一个条目。

想法?

在这种情况下您不必使用正则表达式。

你可以做的是按 space 拆分,然后通过斜杠将结果收集到 defaultdict of defaultdict of int:

In [1]: import re

In [2]: from collections import defaultdict

In [3]: s = "David/NNP Short/NNP will/MD chair/VB the/DT meeting/NN ./. The/DT boy/NN sits/VBZ on/IN the/DT chair/NN
   ...:  ./."

In [4]: d = defaultdict(lambda: defaultdict(int))

In [5]: for item in s.split():
   ...:     word, tag = item.split("/")
   ...:     word = word.lower()
   ...:     d[word][tag] += 1

现在 d 将是:

In [6]: for word, word_data in d.items():
    ...:     for tag, count in word_data.items():
    ...:         print(word, tag, count)
    ...:         
('boy', 'NN', 1)
('short', 'NNP', 1)
('on', 'IN', 1)
('david', 'NNP', 1)
('will', 'MD', 1)
('sits', 'VBZ', 1)
('chair', 'VB', 1)
('chair', 'NN', 1)
('.', '.', 2)
('meeting', 'NN', 1)
('the', 'DT', 3)