数据集中列的 NLTK 命名实体识别

NLTK Named Entity recognition for a column in a dataset

感谢来自此处的 "alvas" 代码,Named Entity Recognition with Regular Expression: NLTK 举个例子:

from nltk import ne_chunk, pos_tag
from nltk.tokenize import word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

txt = 'The new GOP era in Washington got off to a messy start Tuesday as House Republicans,under pressure from President-elect Donald Trump.'
print (get_continuous_chunks(txt))

输出是:

['GOP', 'Washington', 'House Republicans', 'Donald Trump']

我用我的数据集中的 txt = df['content'][38] 替换了这段文本,我得到了这个结果:

['Ina', 'Tori K.', 'Martin Cuilla', 'Phillip K', 'John J Lavorato']

这个数据集有很多行和一个名为 'content' 的列。我的问题是如何使用这段代码从每一行的这一列中提取名称并将该名称存储在另一列和相应的行中?

import os
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
text = df['content']
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)

尝试 apply:

df['ne'] = df['content'].apply(get_continuous_chunks)

对于第二个示例中的代码,创建一个函数并以相同的方式应用它:

def my_st(text):
    tokenized_text = word_tokenize(text)
    return st.tag(tokenized_text)

df['st'] = df['content'].apply(my_st)