数据集中列的 NLTK 命名实体识别
NLTK Named Entity recognition for a column in a dataset
感谢来自此处的 "alvas" 代码,Named Entity Recognition with Regular Expression: NLTK 举个例子:
from nltk import ne_chunk, pos_tag
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
txt = 'The new GOP era in Washington got off to a messy start Tuesday as House Republicans,under pressure from President-elect Donald Trump.'
print (get_continuous_chunks(txt))
输出是:
['GOP', 'Washington', 'House Republicans', 'Donald Trump']
我用我的数据集中的 txt = df['content'][38]
替换了这段文本,我得到了这个结果:
['Ina', 'Tori K.', 'Martin Cuilla', 'Phillip K', 'John J Lavorato']
这个数据集有很多行和一个名为 'content' 的列。我的问题是如何使用这段代码从每一行的这一列中提取名称并将该名称存储在另一列和相应的行中?
import os
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
text = df['content']
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)
尝试 apply
:
df['ne'] = df['content'].apply(get_continuous_chunks)
对于第二个示例中的代码,创建一个函数并以相同的方式应用它:
def my_st(text):
tokenized_text = word_tokenize(text)
return st.tag(tokenized_text)
df['st'] = df['content'].apply(my_st)
感谢来自此处的 "alvas" 代码,Named Entity Recognition with Regular Expression: NLTK 举个例子:
from nltk import ne_chunk, pos_tag
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
txt = 'The new GOP era in Washington got off to a messy start Tuesday as House Republicans,under pressure from President-elect Donald Trump.'
print (get_continuous_chunks(txt))
输出是:
['GOP', 'Washington', 'House Republicans', 'Donald Trump']
我用我的数据集中的 txt = df['content'][38]
替换了这段文本,我得到了这个结果:
['Ina', 'Tori K.', 'Martin Cuilla', 'Phillip K', 'John J Lavorato']
这个数据集有很多行和一个名为 'content' 的列。我的问题是如何使用这段代码从每一行的这一列中提取名称并将该名称存储在另一列和相应的行中?
import os
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
text = df['content']
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)
尝试 apply
:
df['ne'] = df['content'].apply(get_continuous_chunks)
对于第二个示例中的代码,创建一个函数并以相同的方式应用它:
def my_st(text):
tokenized_text = word_tokenize(text)
return st.tag(tokenized_text)
df['st'] = df['content'].apply(my_st)