如何简化查找同形异义词的函数？

Question

我编写了在文本中查找同形异义词的函数。

A homograph is a word that shares the same written form as another word but has a different meaning.

为此，我使用了 NLTK 的 POS-Tagger(pos_tag)。

POS-tagger processes a sequence of words, and attaches a part of speech tag to each word.

例如： [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')].

代码（已编辑）：

def find_homographs(text):
    homographs_dict = {}
    if isinstance(text, str):
        text = word_tokenize(text)
    tagged_tokens = pos_tag(text)
    for tag1 in tagged_tokens:
        for tag2 in tagged_tokens:
            try:
                if dict1[tag2] == tag1:
                    continue
            except KeyError:
                if tag1[0] == tag2[0] and tag1[1] != tag2[1]:
                    dict1[tag1] = tag2
    return homographs_dict

有效，但是太费时间了，因为我用了两个周期for。请告诉我如何简化它并使它更快。

Answer 1

这是一个建议（未测试），但主要思想是在解析 tagged_tokens 时构建字典，以识别 non-nested 循环中的同形异义词：

temp_dict = dict()
for tag in tagged_tokens:
    temp_dict[tag[0]] = temp_dict.get(tag[0],list()).append(tag[1])
for temp in temp_dict.items():
    if len(temp[1]) == 1:
        del temp_dict[temp [0]]
print (temp_dict)

Answer 2

这似乎违反直觉，但您可以轻松地为文本中的每个单词收集所有个 POS 标签，然后仅保留具有多个标签的单词。

from collections import defaultdict
alltags = defaultdict(set)
for word, tag in tagged_tokens:
    alltags[word].add(tag)
homographs = dict((w, tags) for w, tags in alltags.items() if len(tags) > 1)

注意 two-variable 循环；这比写 tag1[0] 和 tag1[1] 方便多了。 defaultdict（和 set）您必须查阅手册。

您的输出格式无法处理具有三个或更多 POS 标签的单词，因此字典 homographs 将单词作为键，将 POS 标签集作为值。

还有两件事我会建议：(1) 将所有单词转换为小写以捕获更多 "homographs"；和 (2) nltk.pos_tag() 期望一次调用一个句子，所以如果你 sent_tokenize() 你的文本和 word_tokenize() 和 pos_tag() 你会得到更正确的标签分别判刑。

如何简化查找同形异义词的函数？

How to simplify the function which finds homographs?

python-3.x

nltk

cycle