使用 POS 标记在文本中搜索短语
Search for phrases in text with POS tagging
我想提取两个名词之间有“of”的短语。
这是我的代码:
import nltk
text = "I live in Kingdom of Spain"
tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens, tagset='universal')
print(tag)
regexes = '''PHRASE:{<NOUN>of<NOUN>}'''
noun_phrase_regex = nltk.RegexpParser(regexes)
result = noun_phrase_regex.parse(tag)
result = list(result)
print(result)
不幸的是,我的结果没有得到 Tree
,所以我的正则表达式不能正常工作。
我试过:
{<NOUN>(of)<NOUN>}
{<NOUN>{of}<NOUN>}
{<NOUN>of<NOUN>}
{<NOUN><of><NOUN>}
但结果是一样的
此外,当我得到结果时,如何从列表中提取 Tree
值,目前,我是这样做的:
result = [element for element in result if type(element) != tuple]
result = [" ".join([word[0] for word in tup_phrase]) for tup_phrase in result]
print(result)
它isn't possible to mix words and POS tags in an nltk parser grammar。
你仍然可以通过其他方式实现你想要的。例如,您可以匹配所有符合您要求的 POS 标签,然后检查结果集是否包含 'of'
以及您想要的该词的任何变体(例如 w/some 大写字母)。那会像这样工作:
import nltk
text = "I live in the Kingdom of Spain"
tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens, tagset='universal')
print(tag)
regexes = 'CHUNK: {<NOUN> <ADP> <NOUN>}'
noun_phrase_regex = nltk.RegexpParser(regexes)
result = noun_phrase_regex.parse(tag)
tree = noun_phrase_regex.parse(tag)
chunks = []
for subtree in tree.subtrees():
if subtree.label() == 'CHUNK':
chunks.append(subtree)
found = []
for chunk in chunks:
leaves = chunk.leaves()
if leaves[1][0] == 'of':
found.append(' '.join([word for word, _ in leaves]))
print(found)
这会给你:
>>> print(found)
['Kingdom of Spain']
>>> nltk.__version__
'3.7'
我想提取两个名词之间有“of”的短语。 这是我的代码:
import nltk
text = "I live in Kingdom of Spain"
tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens, tagset='universal')
print(tag)
regexes = '''PHRASE:{<NOUN>of<NOUN>}'''
noun_phrase_regex = nltk.RegexpParser(regexes)
result = noun_phrase_regex.parse(tag)
result = list(result)
print(result)
不幸的是,我的结果没有得到 Tree
,所以我的正则表达式不能正常工作。
我试过:
{<NOUN>(of)<NOUN>}
{<NOUN>{of}<NOUN>}
{<NOUN>of<NOUN>}
{<NOUN><of><NOUN>}
但结果是一样的
此外,当我得到结果时,如何从列表中提取 Tree
值,目前,我是这样做的:
result = [element for element in result if type(element) != tuple]
result = [" ".join([word[0] for word in tup_phrase]) for tup_phrase in result]
print(result)
它isn't possible to mix words and POS tags in an nltk parser grammar。
你仍然可以通过其他方式实现你想要的。例如,您可以匹配所有符合您要求的 POS 标签,然后检查结果集是否包含 'of'
以及您想要的该词的任何变体(例如 w/some 大写字母)。那会像这样工作:
import nltk
text = "I live in the Kingdom of Spain"
tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens, tagset='universal')
print(tag)
regexes = 'CHUNK: {<NOUN> <ADP> <NOUN>}'
noun_phrase_regex = nltk.RegexpParser(regexes)
result = noun_phrase_regex.parse(tag)
tree = noun_phrase_regex.parse(tag)
chunks = []
for subtree in tree.subtrees():
if subtree.label() == 'CHUNK':
chunks.append(subtree)
found = []
for chunk in chunks:
leaves = chunk.leaves()
if leaves[1][0] == 'of':
found.append(' '.join([word for word, _ in leaves]))
print(found)
这会给你:
>>> print(found)
['Kingdom of Spain']
>>> nltk.__version__
'3.7'