使用 spaCy 提取两个连续的名词
Extract two consecutive nouns using spaCy
这是一个简单的数据集:
import pandas as pd
product = ['knife', 'box set', 'beautiful jewellery set on sale', 'green']
df = pd.DataFrame(product, columns = ['product_name'])
df
输出如下:
product_name
0
knife
1
box set
2
beautiful jewellery set on sale
3
green
如果需要,我想通过提取两个连续的名词来对这些产品进行分类。到目前为止,我有以下内容,但所有情况下类别仅由一个名词表示:
!pip install -q --upgrade spacy
import spacy
nlp = spacy.load('en_core_web_sm')
category=[]
for i in df['product_name'].tolist():
doc = nlp(i)
for t in doc:
if t.pos_ in ['NOUN']:
category.append(f'{t}')
break
if t.pos_ not in ['NOUN']:
category.append('NaN')
df1 = pd.DataFrame(category, columns =['product_category'])
df1
我的输出:
product_category
0
knife
1
set
2
jewellery
3
NaN
预期输出:
product_category
0
knife
1
box set
2
jewellery set
3
NaN
是否可以在代码中引入一些附加条件来提取两个名词接一个接一个?
您可以使用
import spacy
import pandas as pd
import numpy as np
product = ['knife', 'box set', 'beautiful jewellery set on sale', 'green']
df = pd.DataFrame(product, columns = ['product_name'])
nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'POS': 'NOUN'},{'POS': 'NOUN','OP':'?'}]
matcher.add('NOUN_PATTERN', [pattern])
def get_two_nouns(x):
doc = nlp(x)
results = []
for match_id, start, end in matcher(doc):
span = doc[start:end]
results.append(span.text)
return max(results, key = lambda x: len(x.split()), default=np.nan)
df['product_name'].apply(get_two_nouns)
输出:
0 knife
1 box set
2 jewellery set
3 NaN
Name: product_name, dtype: object
pattern = [{'POS': 'NOUN'},{'POS': 'NOUN','OP':'?'}]
模式匹配都是 NOUN
的标记(的组合)。第二个是可选的,因为 OP
运算符设置为 ?
.
return max(results, key = lambda x: len(x.split()), default=np.nan)
部分 returns 长度最长的项目(长度以空格分隔的标记数计算)。
这是一个简单的数据集:
import pandas as pd
product = ['knife', 'box set', 'beautiful jewellery set on sale', 'green']
df = pd.DataFrame(product, columns = ['product_name'])
df
输出如下:
product_name | |
---|---|
0 | knife |
1 | box set |
2 | beautiful jewellery set on sale |
3 | green |
如果需要,我想通过提取两个连续的名词来对这些产品进行分类。到目前为止,我有以下内容,但所有情况下类别仅由一个名词表示:
!pip install -q --upgrade spacy
import spacy
nlp = spacy.load('en_core_web_sm')
category=[]
for i in df['product_name'].tolist():
doc = nlp(i)
for t in doc:
if t.pos_ in ['NOUN']:
category.append(f'{t}')
break
if t.pos_ not in ['NOUN']:
category.append('NaN')
df1 = pd.DataFrame(category, columns =['product_category'])
df1
我的输出:
product_category | |
---|---|
0 | knife |
1 | set |
2 | jewellery |
3 | NaN |
预期输出:
product_category | |
---|---|
0 | knife |
1 | box set |
2 | jewellery set |
3 | NaN |
是否可以在代码中引入一些附加条件来提取两个名词接一个接一个?
您可以使用
import spacy
import pandas as pd
import numpy as np
product = ['knife', 'box set', 'beautiful jewellery set on sale', 'green']
df = pd.DataFrame(product, columns = ['product_name'])
nlp = spacy.load('en_core_web_sm')
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'POS': 'NOUN'},{'POS': 'NOUN','OP':'?'}]
matcher.add('NOUN_PATTERN', [pattern])
def get_two_nouns(x):
doc = nlp(x)
results = []
for match_id, start, end in matcher(doc):
span = doc[start:end]
results.append(span.text)
return max(results, key = lambda x: len(x.split()), default=np.nan)
df['product_name'].apply(get_two_nouns)
输出:
0 knife
1 box set
2 jewellery set
3 NaN
Name: product_name, dtype: object
pattern = [{'POS': 'NOUN'},{'POS': 'NOUN','OP':'?'}]
模式匹配都是 NOUN
的标记(的组合)。第二个是可选的,因为 OP
运算符设置为 ?
.
return max(results, key = lambda x: len(x.split()), default=np.nan)
部分 returns 长度最长的项目(长度以空格分隔的标记数计算)。