如何从文本中提取所有可能的名词短语
How to extract all possible noun phrases from text
我想自动提取文本中一些需要的概念(名词短语)。我的计划是提取所有名词短语,然后将它们标记为两类(即,可取短语和不可取短语)。之后,训练一个分类器对它们进行分类。我现在正在尝试的是首先提取所有可能的短语作为训练集。例如,一个句子是 Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.
我想得到所有短语 shoulder
、richer mix
、shoulder of richer mix
、junctions
、junctions of columns and beams
、columns and beams
、columns
、beams
或任何可能的值。理想的短语是 shoulder
、junctions
、junctions of columns and beams
。但是我不关心这一步的正确性,我只想先得到训练集。是否有可用于此类任务的工具?
我在 rake_nltk 中尝试了 Rake,但结果未能包含我想要的短语(即,它没有提取所有可能的短语)
from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.'
r = Rake()
r.extract_keywords_from_text(data)
phrase = r.get_ranked_phrases()
print(phrase)enter code herenter code here
结果:['richer mix', 'shoulder', 'required', 'junctions', 'items', 'described', 'columns', 'beams']
(这里漏了junctions of columns and beams
)
我也试过phrasemachine,结果也少了一些中意的。
import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
start,end = out['token_spans'].pop()
print(tokens[start:end])
结果:
[(2, 6), (4, 6), (14, 17)]
['junctions', 'of', 'columns']
['richer', 'mix']
['shoulder', 'of', 'richer', 'mix']
(这里遗漏了很多名词短语)
您可能希望使用 noun_chunks
属性:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.')
phrases = set()
for nc in doc.noun_chunks:
phrases.add(nc.text)
phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams', 'junctions', 'the items', 'a shoulder', 'columns', 'richer mix', 'beams', 'columns and beams', 'a shoulder of richer mix', 'these junctions'}
我想自动提取文本中一些需要的概念(名词短语)。我的计划是提取所有名词短语,然后将它们标记为两类(即,可取短语和不可取短语)。之后,训练一个分类器对它们进行分类。我现在正在尝试的是首先提取所有可能的短语作为训练集。例如,一个句子是 Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.
我想得到所有短语 shoulder
、richer mix
、shoulder of richer mix
、junctions
、junctions of columns and beams
、columns and beams
、columns
、beams
或任何可能的值。理想的短语是 shoulder
、junctions
、junctions of columns and beams
。但是我不关心这一步的正确性,我只想先得到训练集。是否有可用于此类任务的工具?
我在 rake_nltk 中尝试了 Rake,但结果未能包含我想要的短语(即,它没有提取所有可能的短语)
from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.'
r = Rake()
r.extract_keywords_from_text(data)
phrase = r.get_ranked_phrases()
print(phrase)enter code herenter code here
结果:['richer mix', 'shoulder', 'required', 'junctions', 'items', 'described', 'columns', 'beams']
(这里漏了junctions of columns and beams
)
我也试过phrasemachine,结果也少了一些中意的。
import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
start,end = out['token_spans'].pop()
print(tokens[start:end])
结果:
[(2, 6), (4, 6), (14, 17)]
['junctions', 'of', 'columns']
['richer', 'mix']
['shoulder', 'of', 'richer', 'mix']
(这里遗漏了很多名词短语)
您可能希望使用 noun_chunks
属性:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.')
phrases = set()
for nc in doc.noun_chunks:
phrases.add(nc.text)
phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams', 'junctions', 'the items', 'a shoulder', 'columns', 'richer mix', 'beams', 'columns and beams', 'a shoulder of richer mix', 'these junctions'}