将 SpaCy 输出添加到现有数据框时,列不对齐
When adding SpaCy output to existing dataframe, columns do not align
我有一个包含文章标题列的 csv,我使用 SpaCy 从中提取标题中出现的任何人名。当尝试使用 SpaCy 提取的名称向 csv 添加新列时,它们与提取它们的行不对齐。
我相信这是因为 SpaCy 结果有自己的索引,独立于原始数据的索引。
我尝试将 , index=df.index)
添加到新的列行,但我得到“ValueError:传递值的长度为 2,索引意味着 10。”
如何将 SpaCy 输出与它们的来源行对齐?
这是我的代码:
import pandas as pd
from pandas import DataFrame
df = (pd.read_csv(r"C:\Users\Admin\Downloads\itsnicethat (5).csv", nrows=10,
usecols=['article_title']))
article = [_ for _ in df['article_title']]
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(str(article))
ents = list(doc.ents)
people = []
for ent in ents:
if ent.label_ == "PERSON":
people.append(ent)
import numpy as np
df['artist_names'] = pd.Series(people)
print(df.head())
这是生成的数据帧:
article_title artist_names
0 “They’re like, is that? Oh it’s!” – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... (Dylan, Mulvaney)
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... NaN
这就是我所期待的:
article_title artist_names
0 “They’re like, is that? Oh it’s!” – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... NaN
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... (Dylan, Mulvaney)
您可以看到artist_names列中的第5个值与第5篇文章标题相关。我怎样才能让它们对齐?
感谢您的帮助。
if ent.label_ == "PERSON":
people.append(ent)
else:
people.append(np.nan) # if ent.label_ is not a PERSON
包含一个 else 语句,因此如果 label_ 不是 PERSON,它将被视为 NaN。
我会遍历文章,分别检测每篇文章中的实体,并将检测到的实体放入一个列表中,每篇文章一个元素:
nlp = spacy.load('en_core_web_lg')
article = [_ for _ in df['article_title']]
entities_by_article = []
for doc in nlp.pipe(article):
people = []
for ent in doc.ents:
if ent.label_ == "PERSON":
people.append(ent)
entities_by_article.append(people)
df['artist_names'] = pd.Series(entities_by_article)
注意:for doc in nlp.pipe(article)
是 spaCy 循环遍历文本列表的更有效方式,可以替换为:
for a in article:
doc = nlp(a)
## rest of code within loop
我有一个包含文章标题列的 csv,我使用 SpaCy 从中提取标题中出现的任何人名。当尝试使用 SpaCy 提取的名称向 csv 添加新列时,它们与提取它们的行不对齐。
我相信这是因为 SpaCy 结果有自己的索引,独立于原始数据的索引。
我尝试将 , index=df.index)
添加到新的列行,但我得到“ValueError:传递值的长度为 2,索引意味着 10。”
如何将 SpaCy 输出与它们的来源行对齐?
这是我的代码:
import pandas as pd
from pandas import DataFrame
df = (pd.read_csv(r"C:\Users\Admin\Downloads\itsnicethat (5).csv", nrows=10,
usecols=['article_title']))
article = [_ for _ in df['article_title']]
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp(str(article))
ents = list(doc.ents)
people = []
for ent in ents:
if ent.label_ == "PERSON":
people.append(ent)
import numpy as np
df['artist_names'] = pd.Series(people)
print(df.head())
这是生成的数据帧:
article_title artist_names
0 “They’re like, is that? Oh it’s!” – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... (Dylan, Mulvaney)
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... NaN
这就是我所期待的:
article_title artist_names
0 “They’re like, is that? Oh it’s!” – ... (Hannah, Ward)
1 Billed as London’s biggest public festival of ... NaN
2 Transport yourself back to the dusky skies and... NaN
3 Turning to art at the beginning of quarantine ... NaN
4 Dylan Mulvaney, head of design at Gretel, expl... (Dylan, Mulvaney)
您可以看到artist_names列中的第5个值与第5篇文章标题相关。我怎样才能让它们对齐?
感谢您的帮助。
if ent.label_ == "PERSON":
people.append(ent)
else:
people.append(np.nan) # if ent.label_ is not a PERSON
包含一个 else 语句,因此如果 label_ 不是 PERSON,它将被视为 NaN。
我会遍历文章,分别检测每篇文章中的实体,并将检测到的实体放入一个列表中,每篇文章一个元素:
nlp = spacy.load('en_core_web_lg')
article = [_ for _ in df['article_title']]
entities_by_article = []
for doc in nlp.pipe(article):
people = []
for ent in doc.ents:
if ent.label_ == "PERSON":
people.append(ent)
entities_by_article.append(people)
df['artist_names'] = pd.Series(entities_by_article)
注意:for doc in nlp.pipe(article)
是 spaCy 循环遍历文本列表的更有效方式,可以替换为:
for a in article:
doc = nlp(a)
## rest of code within loop