如何从文本中获取实体并将它们与源文件的 ID 匹配？

Question

我有一个 csv 文件，其中包含一些列，包括一个 id 列和一个文本列。

示例源文件： source_file

我喜欢使用 spaCy 提取实体文本和标签。然后将实体文本和标签写入具有相应源 ID 的数据框。一个句子很可能包含不止一个实体。这些实体应具有相同的 ID。

desired_output

我认为使用 pd apply 函数是执行此操作的最佳选择，但我收到错误消息。谁能告诉我我做错了什么

df = pd.read_csv(r'data/test_data.csv')
nlp = spacy.load("nl_core_news_lg")
ner_entities = []

def get_entities(row):
    entity_id = row['id']
    text = row['text']
    doc = nlp(Text)
    for ent in doc.ents:
        ner_entities.append([entity_id, ent.text, ent.label_])

df.apply(lambda row: get_entities(row))
ner_df = pd.DataFrame(ner_entities, columns=['id', 'ent', 'label'])
merged_df = pd.merge(df, ner_df, on='id', how='outer')enter code here

我收到以下错误消息：

error message

Answer 1

仅来自评论：

您需要设置 axis=1 才能将函数应用于行。所以df.apply(lambda row: get_entities(row), axis=1)。 axis 否则默认设置为 0。

如何从文本中获取实体并将它们与源文件的 ID 匹配？

How to get entities from a text and match them to the id of the source file?

python

nlp

pandas

spacy-3