使用 POS 标记在文本中搜索短语

Question

我想提取两个名词之间有“of”的短语。这是我的代码：

import nltk

text = "I live in Kingdom of Spain"

tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens, tagset='universal')
print(tag)

regexes = '''PHRASE:{<NOUN>of<NOUN>}'''

noun_phrase_regex = nltk.RegexpParser(regexes)
result = noun_phrase_regex.parse(tag)
result = list(result)
print(result)

不幸的是，我的结果没有得到 Tree，所以我的正则表达式不能正常工作。我试过：

{<NOUN>(of)<NOUN>}
{<NOUN>{of}<NOUN>}
{<NOUN>of<NOUN>}
{<NOUN><of><NOUN>}

但结果是一样的

此外，当我得到结果时，如何从列表中提取 Tree 值，目前，我是这样做的：

result = [element for element in result if type(element) != tuple]
result = [" ".join([word[0] for word in tup_phrase]) for tup_phrase in result]
print(result)

Answer 1

它isn't possible to mix words and POS tags in an nltk parser grammar。你仍然可以通过其他方式实现你想要的。例如，您可以匹配所有符合您要求的 POS 标签，然后检查结果集是否包含 'of' 以及您想要的该词的任何变体（例如 w/some 大写字母）。那会像这样工作：

import nltk


text = "I live in the Kingdom of Spain"
tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens, tagset='universal')
print(tag)

regexes = 'CHUNK: {<NOUN> <ADP> <NOUN>}'
noun_phrase_regex = nltk.RegexpParser(regexes)
result = noun_phrase_regex.parse(tag)

tree = noun_phrase_regex.parse(tag)
chunks = []
for subtree in tree.subtrees():
    if subtree.label() == 'CHUNK': 
        chunks.append(subtree)

found = []
for chunk in chunks:
    leaves = chunk.leaves()
    if leaves[1][0] == 'of':
        found.append(' '.join([word for word, _ in leaves]))

print(found)

这会给你：

>>> print(found)
['Kingdom of Spain']
>>> nltk.__version__
'3.7'

使用 POS 标记在文本中搜索短语

Search for phrases in text with POS tagging

python

nltk