在 Python 3.9 中使用 Spacy 从数据框中删除名称

Question

我正在使用 Python 3.9 中的 spacy 包 v3.2.1，想了解如何使用它从数据框中删除名称。我尝试按照 spacy 文档进行操作，并且能够正确识别名称，但不了解如何删除它们。我的目标是从数据框的特定列中删除所有名称。

实际

ID	Comment
A123	I am five years old, and my name is John
X907	Today I met with Dr. Jacob

我想要完成的事情

ID	Comment
A123	I am five years old, and my name is
X907	Today I met with Dr.

代码：

#loading packages
import spacy
import pandas as pd
from spacy import displacy


#loading CSV
df = pd.read_csv('names.csv)

#loading spacy large model
nlp = spacy.load("en_core_web_lg")

#checking/testing is spacy large is identifying named entities
df['test_col'] = df['Comment'].apply(lambda x: list(nlp(x).ents))

我的代码做了什么

ID	Comment	test_col
A123	I am five years old, and my name is John	[(John)]
X907	Today I met with Dr. Jacob	[(Jacob)]

但是我该如何从“评论”列中删除这些名称呢？我想我是某种函数，它遍历数据框的每一行并删除已识别的实体。非常感谢您的帮助

谢谢

Answer 1

这里有一个使用字符串 replace 方法的想法：

编辑：去掉括号看看是否有帮助。

df['test_col'] = df['Comment'].apply(lambda x: str(x).replace(str(nlp(x).ents).lstrip('(').rstrip(')')), '')

我对变量进行了类型转换以帮助匹配，也不确定它是否是 str。您可能需要使用索引，如果在单个评论中找到多个名称，则循环它，但这就是它的要点。

Answer 2

您可以使用

import spacy
import pandas as pd

# Test dataframe
df = pd.DataFrame({'ID':['A123','X907'], 'Comment':['I am five years old, and my name is John', 'Today I met with Dr. Jacob']})

# Initialize the model
nlp = spacy.load('en_core_web_trf')

def remove_names(text):
    doc = nlp(text)
    newString = text
    for e in reversed(doc.ents):
        if e.label_ == "PERSON": # Only if the entity is a PERSON
            newString = newString[:e.start_char] + newString[e.start_char + len(e.text):]
    return newString

df['Comment'] = df['Comment'].apply(remove_names)
print(df.to_string())

输出：

     ID                               Comment
0  A123  I am five years old, and my name is
1  X907                 Today I met with Dr.

在 Python 3.9 中使用 Spacy 从数据框中删除名称

using Spacy to remove names from a data frame in Python 3.9

python

python-3.x

spacy