错误 运行 我在 pandas 数据框中的文本列上的 spacy 摘要函数
Error running my spacy summarization function on a text column in pandas dataframe
下面是一个用于总结的 spacy 函数,我试图通过 pandas 数据框列 运行 这个函数,我每次都得到空列,我希望有人能提供帮助我弄明白了吗?
def summarize(text, per):
nlp = spacy.load('en_core_web_sm')
doc= nlp(text)
tokens=[token.text for token in doc]
word_frequencies={}
for word in doc:
if word.text.lower() not in list(STOP_WORDS):
if word.text.lower() not in punctuation:
if word.text not in word_frequencies.keys():
word_frequencies[word.text] = 1
else:
word_frequencies[word.text] += 1
max_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word]=word_frequencies[word]/max_frequency
sentence_tokens= [sent for sent in doc.sents]
sentence_scores = {}
for sent in sentence_tokens:
for word in sent:
if word.text.lower() in word_frequencies.keys():
if sent not in sentence_scores.keys():
sentence_scores[sent]=word_frequencies[word.text.lower()]
else:
sentence_scores[sent]+=word_frequencies[word.text.lower()]
select_length=int(len(sentence_tokens)*per)
summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
final_summary=[word.text for word in summary]
summary=''.join(final_summary)
return summary
它还会为示例文本吐出一个空结果:
text = 'gov charlie crist launched amounts nuclear attack republican politics fox news sunday showdown marco rubio crist labeled rubio house speaker tax raiser forth record tax issues crist singled rubios failed 2007 plan eliminated property taxes floridians exchange increase state sales tax tax swap massive tax increase crist said march 28 2010 senate debate respect speaker youve got tell truth people thats rubio contends tax swap huge net tax cut plan supported gov jeb bush tax cut tax hike lets look months speaker early 2007 rubio proposed fundamental change floridas tax structure proposal scratch property taxes primary residences place state sales tax increased 25 cents dollar subject voter approval house analysis originally said swap save taxpayers total 58 billion year certainly contrary crists claim saved money spent money end year likely depended individual circumstances 2007 st petersburg times ran calculations rubios proposal homeowners renters homeowners family annual income 64280 home value 241100 current property tax tampa 506106 sales taxes paid 951 proposed property tax tampa 0 sales taxes paid 1290 rubios plan homeowners paid 4722 state taxes times contrast renters renters family annual income 46914 current rent 851 sales taxes paid 691 proposed rent 851 sales taxes paid 937 rubios plan renters pay additional 246 year taxes rental property owners pay property taxes meaning rent wouldnt affected talked swap swap owned home wouldnt pay tax anymore crist said debate percent fellow floridians renters applied enjoyed tax increase rubio responded renters opportunity buy exorbitant taxes pay property florida gone conversely rubio pointed increased sales tax bring revenue state nonresident visitors tourists contribute said floridians contribute rubios proposal got seal approval grover norquist president americans tax reform rubio supporter 2007 wrote legislators saying rubios tax swap proposal amounted net tax cut speaker rubios proposal net tax cut vote proposal constitute violation taxpayer protection pledge norquist wrote taxpayers florida reap benefits lower tax burden significant spending restraint state local level later house study said sales tax increase generate 93 billion exchange eliminating 158 billion property taxes heres house analysis swap combined tax initiatives tallahassee bunch politicians declare 7 billion net tax savings tax increase rep adam hasner rdelray beach told palm beach post vote proposal saying tax increase swap ultimately killed state senate crist spokeswoman andrea saul noted rubio said tax swap tax increase march 28 2010 debate according transcripts rubio said let tell supposed program raise taxes keeps talking probably largest tax increase floridas history eliminated property taxes sorts people supported jeb bush rubio spokesman alberto martinez said rubio mispoke shocking try distort martinez said based statements surround rubios largest tax increase line reasonable meant decrease crist said rubios tax swap proposal massive tax increase basic level rubios proposal tax increase tax decrease state sales tax property taxes micro level people pay pay macro level different studies said floridians paid 58 billion 65 billion generally leery tax impact projections suggestion rubios plan resulted tax increase statewide certainly massive crist suggests'
summarize(text)
我不知道该函数是错误的还是其他原因,但后来我尝试通过数据框列 运行 它,但我又得到一个空列:
df['spacy_summary'] = df['final'].apply(lambda x: summarize(x, 0.05))
所以我猜这是函数?所以任何帮助表示赞赏。谢谢!
您的文本摘要的逻辑假设存在 SpaCy 可以识别的有效句子,但您的示例文本没有提供。 SpaCy 可能会把它全部放在一个长句子中,我不认为你输入的文本会被分成多个句子。句子分割需要有效的文本输入,标点符号等。尝试使用 SpaCy 可识别的由多个句子组成的文本。
这与您使用 int(len(sentence_tokens)*per)
的事实相结合。 int 转换向下舍入到下一个较小的整数。所以 int(1*0.05) = int(0.05) = 0
,也就是 returns 0 个句子。对于少于 20 个分段句子的每个文本都会发生这种情况。所以改变这个比例或使用类似 max(1, int(len(sentence_tokens)*per))
.
的东西
除此之外,我认为代码通常应该可以工作。虽然我没有看每一个细节。但我不确定你是否确切知道它的作用:它通过仅保留最有代表性的完整句子的 per
份额进行总结,它不会在单词级别上改变任何内容。
下面是一个用于总结的 spacy 函数,我试图通过 pandas 数据框列 运行 这个函数,我每次都得到空列,我希望有人能提供帮助我弄明白了吗?
def summarize(text, per):
nlp = spacy.load('en_core_web_sm')
doc= nlp(text)
tokens=[token.text for token in doc]
word_frequencies={}
for word in doc:
if word.text.lower() not in list(STOP_WORDS):
if word.text.lower() not in punctuation:
if word.text not in word_frequencies.keys():
word_frequencies[word.text] = 1
else:
word_frequencies[word.text] += 1
max_frequency=max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word]=word_frequencies[word]/max_frequency
sentence_tokens= [sent for sent in doc.sents]
sentence_scores = {}
for sent in sentence_tokens:
for word in sent:
if word.text.lower() in word_frequencies.keys():
if sent not in sentence_scores.keys():
sentence_scores[sent]=word_frequencies[word.text.lower()]
else:
sentence_scores[sent]+=word_frequencies[word.text.lower()]
select_length=int(len(sentence_tokens)*per)
summary=nlargest(select_length, sentence_scores,key=sentence_scores.get)
final_summary=[word.text for word in summary]
summary=''.join(final_summary)
return summary
它还会为示例文本吐出一个空结果:
text = 'gov charlie crist launched amounts nuclear attack republican politics fox news sunday showdown marco rubio crist labeled rubio house speaker tax raiser forth record tax issues crist singled rubios failed 2007 plan eliminated property taxes floridians exchange increase state sales tax tax swap massive tax increase crist said march 28 2010 senate debate respect speaker youve got tell truth people thats rubio contends tax swap huge net tax cut plan supported gov jeb bush tax cut tax hike lets look months speaker early 2007 rubio proposed fundamental change floridas tax structure proposal scratch property taxes primary residences place state sales tax increased 25 cents dollar subject voter approval house analysis originally said swap save taxpayers total 58 billion year certainly contrary crists claim saved money spent money end year likely depended individual circumstances 2007 st petersburg times ran calculations rubios proposal homeowners renters homeowners family annual income 64280 home value 241100 current property tax tampa 506106 sales taxes paid 951 proposed property tax tampa 0 sales taxes paid 1290 rubios plan homeowners paid 4722 state taxes times contrast renters renters family annual income 46914 current rent 851 sales taxes paid 691 proposed rent 851 sales taxes paid 937 rubios plan renters pay additional 246 year taxes rental property owners pay property taxes meaning rent wouldnt affected talked swap swap owned home wouldnt pay tax anymore crist said debate percent fellow floridians renters applied enjoyed tax increase rubio responded renters opportunity buy exorbitant taxes pay property florida gone conversely rubio pointed increased sales tax bring revenue state nonresident visitors tourists contribute said floridians contribute rubios proposal got seal approval grover norquist president americans tax reform rubio supporter 2007 wrote legislators saying rubios tax swap proposal amounted net tax cut speaker rubios proposal net tax cut vote proposal constitute violation taxpayer protection pledge norquist wrote taxpayers florida reap benefits lower tax burden significant spending restraint state local level later house study said sales tax increase generate 93 billion exchange eliminating 158 billion property taxes heres house analysis swap combined tax initiatives tallahassee bunch politicians declare 7 billion net tax savings tax increase rep adam hasner rdelray beach told palm beach post vote proposal saying tax increase swap ultimately killed state senate crist spokeswoman andrea saul noted rubio said tax swap tax increase march 28 2010 debate according transcripts rubio said let tell supposed program raise taxes keeps talking probably largest tax increase floridas history eliminated property taxes sorts people supported jeb bush rubio spokesman alberto martinez said rubio mispoke shocking try distort martinez said based statements surround rubios largest tax increase line reasonable meant decrease crist said rubios tax swap proposal massive tax increase basic level rubios proposal tax increase tax decrease state sales tax property taxes micro level people pay pay macro level different studies said floridians paid 58 billion 65 billion generally leery tax impact projections suggestion rubios plan resulted tax increase statewide certainly massive crist suggests'
summarize(text)
我不知道该函数是错误的还是其他原因,但后来我尝试通过数据框列 运行 它,但我又得到一个空列:
df['spacy_summary'] = df['final'].apply(lambda x: summarize(x, 0.05))
所以我猜这是函数?所以任何帮助表示赞赏。谢谢!
您的文本摘要的逻辑假设存在 SpaCy 可以识别的有效句子,但您的示例文本没有提供。 SpaCy 可能会把它全部放在一个长句子中,我不认为你输入的文本会被分成多个句子。句子分割需要有效的文本输入,标点符号等。尝试使用 SpaCy 可识别的由多个句子组成的文本。
这与您使用 int(len(sentence_tokens)*per)
的事实相结合。 int 转换向下舍入到下一个较小的整数。所以 int(1*0.05) = int(0.05) = 0
,也就是 returns 0 个句子。对于少于 20 个分段句子的每个文本都会发生这种情况。所以改变这个比例或使用类似 max(1, int(len(sentence_tokens)*per))
.
除此之外,我认为代码通常应该可以工作。虽然我没有看每一个细节。但我不确定你是否确切知道它的作用:它通过仅保留最有代表性的完整句子的 per
份额进行总结,它不会在单词级别上改变任何内容。