使用 ne_chunks 提取全名
Extracting full names with ne_chunks
这里是新手。我正在尝试使用以下代码提取人员和组织的全名。
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(' '.join([token for token, pos in i.leaves()]))
if current_chunk:
named_entity = ' '.join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
>>> my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
>>> get_continuous_chunks(my_sent)
['Toni']
如您所见,它只返回第一个专有名词。不是全名,也不是字符串中的任何其他专有名词。
我做错了什么?
这是一些工作代码。
最好的办法是单步执行代码并在不同的地方放置大量打印语句。您将看到我在哪里打印了您正在迭代的项目的 type()
和 str()
值。我发现这有助于我想象并更多地思考我正在编写的循环和条件,如果我能看到它们被列出的话。
此外,哎呀,我无意中将所有变量命名为“连续”而不是“连续”……不确定为什么……连续可能更准确
代码:
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
current_chunk = []
contiguous_chunk = []
contiguous_chunks = []
for i in chunked:
print(f"{type(i)}: {i}")
if type(i) == Tree:
current_chunk = ' '.join([token for token, pos in i.leaves()])
# Apparently, Tony and Morrison are two separate items,
# but "Random House" and "New York City" are single items.
contiguous_chunk.append(current_chunk)
else:
# discontiguous, append to known contiguous chunks.
if len(contiguous_chunk) > 0:
contiguous_chunks.append(' '.join(contiguous_chunk))
contiguous_chunk = []
current_chunk = []
return contiguous_chunks
my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
print()
contig_chunks = get_continuous_chunks(my_sent)
print(f"INPUT: My sentence: '{my_sent}'")
print(f"ANSWER: My contiguous chunks: {contig_chunks}")
执行:
(venv) [ttucker@zim Whosebug]$ python contig.py
<class 'nltk.tree.Tree'>: (PERSON Toni/NNP)
<class 'nltk.tree.Tree'>: (PERSON Morrison/NNP)
<class 'tuple'>: ('was', 'VBD')
<class 'tuple'>: ('the', 'DT')
<class 'tuple'>: ('first', 'JJ')
<class 'tuple'>: ('black', 'JJ')
<class 'tuple'>: ('female', 'NN')
<class 'tuple'>: ('editor', 'NN')
<class 'tuple'>: ('in', 'IN')
<class 'tuple'>: ('fiction', 'NN')
<class 'tuple'>: ('at', 'IN')
<class 'nltk.tree.Tree'>: (ORGANIZATION Random/NNP House/NNP)
<class 'tuple'>: ('in', 'IN')
<class 'nltk.tree.Tree'>: (GPE New/NNP York/NNP City/NNP)
<class 'tuple'>: ('.', '.')
INPUT: My sentence: 'Toni Morrison was the first black female editor in fiction at Random House in New York City.'
ANSWER: My contiguous chunks: ['Toni Morrison', 'Random House', 'New York City']
我也不太清楚你到底在找什么,但从描述来看,好像是这样。
这里是新手。我正在尝试使用以下代码提取人员和组织的全名。
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(' '.join([token for token, pos in i.leaves()]))
if current_chunk:
named_entity = ' '.join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
>>> my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
>>> get_continuous_chunks(my_sent)
['Toni']
如您所见,它只返回第一个专有名词。不是全名,也不是字符串中的任何其他专有名词。
我做错了什么?
这是一些工作代码。
最好的办法是单步执行代码并在不同的地方放置大量打印语句。您将看到我在哪里打印了您正在迭代的项目的 type()
和 str()
值。我发现这有助于我想象并更多地思考我正在编写的循环和条件,如果我能看到它们被列出的话。
此外,哎呀,我无意中将所有变量命名为“连续”而不是“连续”……不确定为什么……连续可能更准确
代码:
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
current_chunk = []
contiguous_chunk = []
contiguous_chunks = []
for i in chunked:
print(f"{type(i)}: {i}")
if type(i) == Tree:
current_chunk = ' '.join([token for token, pos in i.leaves()])
# Apparently, Tony and Morrison are two separate items,
# but "Random House" and "New York City" are single items.
contiguous_chunk.append(current_chunk)
else:
# discontiguous, append to known contiguous chunks.
if len(contiguous_chunk) > 0:
contiguous_chunks.append(' '.join(contiguous_chunk))
contiguous_chunk = []
current_chunk = []
return contiguous_chunks
my_sent = "Toni Morrison was the first black female editor in fiction at Random House in New York City."
print()
contig_chunks = get_continuous_chunks(my_sent)
print(f"INPUT: My sentence: '{my_sent}'")
print(f"ANSWER: My contiguous chunks: {contig_chunks}")
执行:
(venv) [ttucker@zim Whosebug]$ python contig.py
<class 'nltk.tree.Tree'>: (PERSON Toni/NNP)
<class 'nltk.tree.Tree'>: (PERSON Morrison/NNP)
<class 'tuple'>: ('was', 'VBD')
<class 'tuple'>: ('the', 'DT')
<class 'tuple'>: ('first', 'JJ')
<class 'tuple'>: ('black', 'JJ')
<class 'tuple'>: ('female', 'NN')
<class 'tuple'>: ('editor', 'NN')
<class 'tuple'>: ('in', 'IN')
<class 'tuple'>: ('fiction', 'NN')
<class 'tuple'>: ('at', 'IN')
<class 'nltk.tree.Tree'>: (ORGANIZATION Random/NNP House/NNP)
<class 'tuple'>: ('in', 'IN')
<class 'nltk.tree.Tree'>: (GPE New/NNP York/NNP City/NNP)
<class 'tuple'>: ('.', '.')
INPUT: My sentence: 'Toni Morrison was the first black female editor in fiction at Random House in New York City.'
ANSWER: My contiguous chunks: ['Toni Morrison', 'Random House', 'New York City']
我也不太清楚你到底在找什么,但从描述来看,好像是这样。