从单词和词性生成多级词典
Producing multi-level dictionary from word and part-of-speech
给定一些 Penn Treebank 以这种格式标记的文本:
"David/NNP Short/NNP will/MD chair/VB the/DT meeting/NN ./. The/DT boy/NN sits/VBZ on/IN the/DT chair/NN ./."
我想生成一个多级字典,以单词为键并计算它出现的频率,标记为每个 POS,所以我们有 ['Chair, VB : 1, NN : 1'、'The, DT : 3'、] 等.
我想我可以使用正则表达式来提取单词和相应的 POS。
r'[A+Za+z]+/' and r'/[A+Z]+'
但无法弄清楚如何将它们放在一起来为单词及其相应的 POS 出现创建一个条目。
想法?
在这种情况下您不必使用正则表达式。
你可以做的是按 space 拆分,然后通过斜杠将结果收集到 defaultdict
of defaultdict
of int
:
In [1]: import re
In [2]: from collections import defaultdict
In [3]: s = "David/NNP Short/NNP will/MD chair/VB the/DT meeting/NN ./. The/DT boy/NN sits/VBZ on/IN the/DT chair/NN
...: ./."
In [4]: d = defaultdict(lambda: defaultdict(int))
In [5]: for item in s.split():
...: word, tag = item.split("/")
...: word = word.lower()
...: d[word][tag] += 1
现在 d
将是:
In [6]: for word, word_data in d.items():
...: for tag, count in word_data.items():
...: print(word, tag, count)
...:
('boy', 'NN', 1)
('short', 'NNP', 1)
('on', 'IN', 1)
('david', 'NNP', 1)
('will', 'MD', 1)
('sits', 'VBZ', 1)
('chair', 'VB', 1)
('chair', 'NN', 1)
('.', '.', 2)
('meeting', 'NN', 1)
('the', 'DT', 3)
给定一些 Penn Treebank 以这种格式标记的文本:
"David/NNP Short/NNP will/MD chair/VB the/DT meeting/NN ./. The/DT boy/NN sits/VBZ on/IN the/DT chair/NN ./."
我想生成一个多级字典,以单词为键并计算它出现的频率,标记为每个 POS,所以我们有 ['Chair, VB : 1, NN : 1'、'The, DT : 3'、] 等.
我想我可以使用正则表达式来提取单词和相应的 POS。
r'[A+Za+z]+/' and r'/[A+Z]+'
但无法弄清楚如何将它们放在一起来为单词及其相应的 POS 出现创建一个条目。
想法?
在这种情况下您不必使用正则表达式。
你可以做的是按 space 拆分,然后通过斜杠将结果收集到 defaultdict
of defaultdict
of int
:
In [1]: import re
In [2]: from collections import defaultdict
In [3]: s = "David/NNP Short/NNP will/MD chair/VB the/DT meeting/NN ./. The/DT boy/NN sits/VBZ on/IN the/DT chair/NN
...: ./."
In [4]: d = defaultdict(lambda: defaultdict(int))
In [5]: for item in s.split():
...: word, tag = item.split("/")
...: word = word.lower()
...: d[word][tag] += 1
现在 d
将是:
In [6]: for word, word_data in d.items():
...: for tag, count in word_data.items():
...: print(word, tag, count)
...:
('boy', 'NN', 1)
('short', 'NNP', 1)
('on', 'IN', 1)
('david', 'NNP', 1)
('will', 'MD', 1)
('sits', 'VBZ', 1)
('chair', 'VB', 1)
('chair', 'NN', 1)
('.', '.', 2)
('meeting', 'NN', 1)
('the', 'DT', 3)