使用 spaCy nlp 分析 df 中的文档列时出现问题
Problem analyzing a doc column in a df with spaCy nlp
在使用亚马逊评论抓取器构建此数据框后,我调用了 nlp 以标记化并创建一个包含已处理评论的新列 'docs'
但是,现在我正在尝试创建一种模式以分析文档列中的评论,但我一直在了解匹配项,这让我觉得我又错过了一个预处理步骤,或者也许没有将匹配器指向正确的方向。
虽然以下代码没有任何错误地执行,但我收到了一个包含 0 的匹配列表 - 即使我知道该词存在于 doc 列中。 spaCy 的文档仍然有点薄,我不太确定 matcher.add 是否正确,因为教程中的特定内容
matcher.add("Name_of_List", None, pattern)
returns 一个错误,指出此 class.
只需要 2 个参数
问题:我需要更改什么才能准确分析所创建模式的 df doc 列?
谢谢!
完整代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_md')
df = pd.read_csv('paper_towel_US.csv')
#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]
# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]
#calling matcher to create pattern
matcher = Matcher(nlp.vocab)
pattern =[{"LEMMA": "love"},
{"OP":"+"}
]
matcher.add("QUALITY_PATTERN", [pattern])
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
df['doc'].apply(find_matches)
通过 df.iloc[596:600, :].to_clipboard(sep=',')
复制的 df 样本
,product,title,rating,body,doc,num_tokens
596,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Awesome!,5,Great towels!,Great towels!,3
597,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Good buy!,5,Love these,Love these,2
598,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Meh,3,"Does not clean countertop messes well. Towels leave a large residue. They are durable, though","Does not clean countertop messes well. Towels leave a large residue. They are durable, though",18
599,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Exactly as Described. Packaged Well and Mailed Promptly,4,Exactly as Described. Packaged Well and Mailed Promptly,Exactly as Described. Packaged Well and Mailed Promptly,9
您正在尝试从 "df.doc"
字符串中获取 doc = nlp("df.doc")
的匹配项。您需要从 df['doc']
列中提取匹配项。
一个示例解决方案是删除 doc = nlp("df.doc")
并使用 nlp = spacy.load('en_core_web_sm')
:
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
>>> df['doc'].apply(find_matches)
0 None
1 (0, 2, Love these)
2 None
3 None
Name: doc, dtype: object
完整代码片段:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
df = pd.read_csv(r'C:\Users\admin\Desktop\s.txt')
#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]
# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]
#calling matcher to create pattern
matcher = Matcher(nlp.vocab)
pattern =[{"LEMMA": "love"},
{"OP":"+"}
]
matcher.add("QUALITY_PATTERN", [pattern])
#doc = nlp("df.doc")
#matches = matcher(doc)
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
print(df['doc'].apply(find_matches))
在使用亚马逊评论抓取器构建此数据框后,我调用了 nlp 以标记化并创建一个包含已处理评论的新列 'docs'
但是,现在我正在尝试创建一种模式以分析文档列中的评论,但我一直在了解匹配项,这让我觉得我又错过了一个预处理步骤,或者也许没有将匹配器指向正确的方向。
虽然以下代码没有任何错误地执行,但我收到了一个包含 0 的匹配列表 - 即使我知道该词存在于 doc 列中。 spaCy 的文档仍然有点薄,我不太确定 matcher.add 是否正确,因为教程中的特定内容
matcher.add("Name_of_List", None, pattern)
returns 一个错误,指出此 class.
只需要 2 个参数问题:我需要更改什么才能准确分析所创建模式的 df doc 列?
谢谢!
完整代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_md')
df = pd.read_csv('paper_towel_US.csv')
#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]
# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]
#calling matcher to create pattern
matcher = Matcher(nlp.vocab)
pattern =[{"LEMMA": "love"},
{"OP":"+"}
]
matcher.add("QUALITY_PATTERN", [pattern])
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
df['doc'].apply(find_matches)
通过 df.iloc[596:600, :].to_clipboard(sep=',')
,product,title,rating,body,doc,num_tokens
596,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Awesome!,5,Great towels!,Great towels!,3
597,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Good buy!,5,Love these,Love these,2
598,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Meh,3,"Does not clean countertop messes well. Towels leave a large residue. They are durable, though","Does not clean countertop messes well. Towels leave a large residue. They are durable, though",18
599,Amazon.com: Customer reviews: Bamboo Towels - Heavy Duty Machine Washable Reusable Rayon Towels - One roll replaces 6 months of towels! 1 Pack,Exactly as Described. Packaged Well and Mailed Promptly,4,Exactly as Described. Packaged Well and Mailed Promptly,Exactly as Described. Packaged Well and Mailed Promptly,9
您正在尝试从 "df.doc"
字符串中获取 doc = nlp("df.doc")
的匹配项。您需要从 df['doc']
列中提取匹配项。
一个示例解决方案是删除 doc = nlp("df.doc")
并使用 nlp = spacy.load('en_core_web_sm')
:
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
>>> df['doc'].apply(find_matches)
0 None
1 (0, 2, Love these)
2 None
3 None
Name: doc, dtype: object
完整代码片段:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
df = pd.read_csv(r'C:\Users\admin\Desktop\s.txt')
#calling on NLP to return processed doc for each review
df['doc'] = [nlp(body) for body in df.body]
# Sum the number of tokens in each Doc
df['num_tokens'] = [len(token) for token in df.doc]
#calling matcher to create pattern
matcher = Matcher(nlp.vocab)
pattern =[{"LEMMA": "love"},
{"OP":"+"}
]
matcher.add("QUALITY_PATTERN", [pattern])
#doc = nlp("df.doc")
#matches = matcher(doc)
def find_matches(doc):
spans = [doc[start:end] for _, start, end in matcher(doc)]
for span in spacy.util.filter_spans(spans):
return ((span.start, span.end, span.text))
print(df['doc'].apply(find_matches))