将 SpaCy 输出添加到现有数据框时，列不对齐

Question

我有一个包含文章标题列的 csv，我使用 SpaCy 从中提取标题中出现的任何人名。当尝试使用 SpaCy 提取的名称向 csv 添加新列时，它们与提取它们的行不对齐。

我相信这是因为 SpaCy 结果有自己的索引，独立于原始数据的索引。

我尝试将 , index=df.index) 添加到新的列行，但我得到“ValueError：传递值的长度为 2，索引意味着 10。”

如何将 SpaCy 输出与它们的来源行对齐？

这是我的代码：

import pandas as pd
from pandas import DataFrame
df = (pd.read_csv(r"C:\Users\Admin\Downloads\itsnicethat (5).csv", nrows=10,
                  usecols=['article_title']))
article = [_ for _ in df['article_title']]

import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(str(article))
ents = list(doc.ents)
people = []
for ent in ents:
    if ent.label_ == "PERSON":
        people.append(ent)

import numpy as np
df['artist_names'] = pd.Series(people)
print(df.head())

这是生成的数据帧：

                                       article_title       artist_names
0  “They’re like, is that? Oh it’s!” – ...               (Hannah, Ward)
1  Billed as London’s biggest public festival of ...  (Dylan, Mulvaney)
2  Transport yourself back to the dusky skies and...                NaN
3  Turning to art at the beginning of quarantine ...                NaN
4  Dylan Mulvaney, head of design at Gretel, expl...                NaN

这就是我所期待的：

                                       article_title       artist_names
0  “They’re like, is that? Oh it’s!” – ...               (Hannah, Ward)
1  Billed as London’s biggest public festival of ...                NaN
2  Transport yourself back to the dusky skies and...                NaN
3  Turning to art at the beginning of quarantine ...                NaN
4  Dylan Mulvaney, head of design at Gretel, expl...   (Dylan, Mulvaney)

您可以看到artist_names列中的第5个值与第5篇文章标题相关。我怎样才能让它们对齐？

感谢您的帮助。

Answer 1

    if ent.label_ == "PERSON":
        people.append(ent)
    else:
        people.append(np.nan) # if ent.label_ is not a PERSON

包含一个 else 语句，因此如果 label_ 不是 PERSON，它将被视为 NaN。

Answer 2

我会遍历文章，分别检测每篇文章中的实体，并将检测到的实体放入一个列表中，每篇文章一个元素：

nlp = spacy.load('en_core_web_lg')
article = [_ for _ in df['article_title']]

entities_by_article = []
for doc in nlp.pipe(article):
  people = []
  for ent in doc.ents:
    if ent.label_ == "PERSON":
      people.append(ent)
  entities_by_article.append(people)

df['artist_names'] = pd.Series(entities_by_article)

注意：for doc in nlp.pipe(article) 是 spaCy 循环遍历文本列表的更有效方式，可以替换为：

for a in article:
  doc = nlp(a)
  ## rest of code within loop

将 SpaCy 输出添加到现有数据框时，列不对齐

When adding SpaCy output to existing dataframe, columns do not align

python

pandas

spacy