使用 POS 标记在文本中搜索短语

Search for phrases in text with POS tagging

我想提取两个名词之间有“of”的短语。 这是我的代码:

import nltk

text = "I live in Kingdom of Spain"

tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens, tagset='universal')
print(tag)

regexes = '''PHRASE:{<NOUN>of<NOUN>}'''

noun_phrase_regex = nltk.RegexpParser(regexes)
result = noun_phrase_regex.parse(tag)
result = list(result)
print(result)

不幸的是,我的结果没有得到 Tree,所以我的正则表达式不能正常工作。 我试过:

{<NOUN>(of)<NOUN>}
{<NOUN>{of}<NOUN>}
{<NOUN>of<NOUN>}
{<NOUN><of><NOUN>}

但结果是一样的

此外,当我得到结果时,如何从列表中提取 Tree 值,目前,我是这样做的:

result = [element for element in result if type(element) != tuple]
result = [" ".join([word[0] for word in tup_phrase]) for tup_phrase in result]
print(result)

isn't possible to mix words and POS tags in an nltk parser grammar。 你仍然可以通过其他方式实现你想要的。例如,您可以匹配所有符合您要求的 POS 标签,然后检查结果集是否包含 'of' 以及您想要的该词的任何变体(例如 w/some 大写字母)。那会像这样工作:

import nltk


text = "I live in the Kingdom of Spain"
tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens, tagset='universal')
print(tag)

regexes = 'CHUNK: {<NOUN> <ADP> <NOUN>}'
noun_phrase_regex = nltk.RegexpParser(regexes)
result = noun_phrase_regex.parse(tag)

tree = noun_phrase_regex.parse(tag)
chunks = []
for subtree in tree.subtrees():
    if subtree.label() == 'CHUNK': 
        chunks.append(subtree)

found = []
for chunk in chunks:
    leaves = chunk.leaves()
    if leaves[1][0] == 'of':
        found.append(' '.join([word for word, _ in leaves]))

print(found)

这会给你:

>>> print(found)
['Kingdom of Spain']
>>> nltk.__version__
'3.7'