使用 `testacy.extract.pos_regex_matches(...)` 将 PoS 标签与特定文本匹配

Question

我正在使用 textacy 的 pos_regex_matches 方法来查找句子中的某些文本块。

例如，假设我有文本：Huey, Dewey, and Louie are triplet cartoon characters.，我想检测 Huey, Dewey, and Louie 是一个枚举。

为此，我使用了以下代码（在 testacy 0.3.4，撰写本文时可用的版本）：

import textacy

sentence = 'Huey, Dewey, and Louie are triplet cartoon characters.'
pattern = r'<PROPN>+ (<PUNCT|CCONJ> <PUNCT|CCONJ>? <PROPN>+)*'
doc = textacy.Doc(sentence, lang='en')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
    print(list.text)

打印：

Huey, Dewey, and Louie

但是，如果我有类似下面的内容：

sentence = 'Donald Duck - Disney'

然后 -（破折号）被识别为 <PUNCT> 并且整个句子被识别为列表 - 但事实并非如此。

有没有办法指定只有 , 和 ; 对列表有效 <PUNCT>？

我一直在寻找有关这种用于匹配 PoS 标签的正则表达式语言的参考资料，但没有成功，有人可以帮忙吗？提前致谢！

Answer 1

是短的，不可能的：见this official page.

但是合并请求包含页面中描述的修改版本的代码，因此可以重新创建功能，尽管它的性能不如使用 SpaCy 的 Matcher（参见 code and example -- 尽管我不知道如何使用 Matcher).

重新实现我的问题

如果你想走这条路，你必须换线：

words.extend(map(lambda x: re.sub(r'\W', '', x), keyword_map[w]))

具有以下内容：

words.extend(keyword_map[w])

否则每个符号（如 , 和 ; 在我的例子中）将被剥离。

使用 `testacy.extract.pos_regex_matches(...)` 将 PoS 标签与特定文本匹配

Matching PoS tags with specific text with `testacy.extract.pos_regex_matches(...)`

regex

nlp

pos-tagger

python-3.x

spacy