使用 Pandas 数据框进行 Spacy 依赖解析
Spacy Dependency Parsing with Pandas dataframe
我想在我的 pandas 数据帧上使用 Spacy 的依赖解析器提取名词-形容词对以进行基于方面的情感分析。我在 Kaggle 的亚马逊美食评论数据集上尝试这段代码:
但是,我将 pandas 数据框提供给 spacy 的方式似乎有问题。我的结果不是我期望的那样。请有人帮我调试这个。非常感谢。
!python -m spacy download en_core_web_lg
import nltk
nltk.download('vader_lexicon')
import spacy
nlp = spacy.load("en_core_web_lg")
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
def find_sentiment(doc):
# find roots of all entities in the text
for i in df['Text'].tolist():
doc = nlp(i)
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ == "nsubj" and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ == "acomp" and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M, sid.polarity_scores(M)['compound']))
return rule3_pairs
df['three_tuples'] = df['Text'].apply(find_sentiment)
df.head()
我的结果是这样的,这显然意味着我的循环有问题:
如果您在 df['Text']
上调用 apply
,那么您实际上是在遍历该列中的每个值并将该值传递给一个函数。
然而,在这里,您的函数本身会迭代您正在应用该函数的同一数据框列,同时也会覆盖在函数早期传递给它的值。
因此,我将从按如下方式重写函数开始,看看它是否产生了预期的结果。我不能肯定地说,因为你没有 post 任何样本数据,但这至少应该使球向前移动:
def find_sentiment(text):
doc = nlp(text)
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ == "nsubj" and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ == "acomp" and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M, sid.polarity_scores(M)['compound']))
return rule3_pairs
我想在我的 pandas 数据帧上使用 Spacy 的依赖解析器提取名词-形容词对以进行基于方面的情感分析。我在 Kaggle 的亚马逊美食评论数据集上尝试这段代码:
但是,我将 pandas 数据框提供给 spacy 的方式似乎有问题。我的结果不是我期望的那样。请有人帮我调试这个。非常感谢。
!python -m spacy download en_core_web_lg
import nltk
nltk.download('vader_lexicon')
import spacy
nlp = spacy.load("en_core_web_lg")
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
def find_sentiment(doc):
# find roots of all entities in the text
for i in df['Text'].tolist():
doc = nlp(i)
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ == "nsubj" and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ == "acomp" and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M, sid.polarity_scores(M)['compound']))
return rule3_pairs
df['three_tuples'] = df['Text'].apply(find_sentiment)
df.head()
我的结果是这样的,这显然意味着我的循环有问题:
如果您在 df['Text']
上调用 apply
,那么您实际上是在遍历该列中的每个值并将该值传递给一个函数。
然而,在这里,您的函数本身会迭代您正在应用该函数的同一数据框列,同时也会覆盖在函数早期传递给它的值。
因此,我将从按如下方式重写函数开始,看看它是否产生了预期的结果。我不能肯定地说,因为你没有 post 任何样本数据,但这至少应该使球向前移动:
def find_sentiment(text):
doc = nlp(text)
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ == "nsubj" and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ == "acomp" and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M, sid.polarity_scores(M)['compound']))
return rule3_pairs