按名词短语块合并 POS 标签
mergnig POS tag by noun phrase chunk
我的问题与此类似。在spacy
中,我可以分别做词性标注和名词短语识别e.g.
import spacy
nlp = spacy.load('en')
sentence = 'For instance , consider one simple phenomena :
a question is typically followed by an answer ,
or some explicit statement of an inability or refusal to answer .'
token = nlp(sentence)
token_tag = [(word.text, word.pos_) for word in token]
输出如下:
[('For', 'ADP'),
('instance', 'NOUN'),
(',', 'PUNCT'),
('consider', 'VERB'),
('one', 'NUM'),
('simple', 'ADJ'),
('phenomena', 'NOUN'),
...]
对于名词短语或块,我可以得到 noun_chunks
这是一个词块,如下所示:
[nc for nc in token.noun_chunks] # [instance, one simple phenomena, an answer, ...]
我想知道是否有一种方法可以根据 noun_chunks
对 POS 标签进行聚类,以便得到输出
[('For', 'ADP'),
('instance', 'NOUN'), # or NOUN_CHUNKS
(',', 'PUNCT'),
('one simple phenomena', 'NOUN_CHUNKS'),
...]
我知道怎么做了。基本上,我们可以按如下方式获取名词短语标记的开始和结束位置:
noun_phrase_position = [(s.start, s.end) for s in token.noun_chunks]
noun_phrase_text = dict([(s.start, s.text) for s in token.noun_chunks])
token_pos = [(i, t.text, t.pos_) for i, t in enumerate(token)]
然后我结合这个 以根据 start
、stop
位置
合并 token_pos
的列表
result = []
for start, end in noun_phrase_position:
result += token_pos[index:start]
result.append(token_pos[start:end])
index = end
result_merge = []
for i, r in enumerate(result):
if len(r) > 0 and isinstance(r, list):
result_merge.append((r[0][0], noun_phrase_text.get(r[0][0]), 'NOUN_PHRASE'))
else:
result_merge.append(r)
输出
[(1, 'instance', 'NOUN_PHRASE'),
(2, ',', 'PUNCT'),
(3, 'consider', 'VERB'),
(4, 'one simple phenomena', 'NOUN_PHRASE'),
(7, ':', 'PUNCT'),
(8, 'a', 'DET'), ...
我的问题与此类似spacy
中,我可以分别做词性标注和名词短语识别e.g.
import spacy
nlp = spacy.load('en')
sentence = 'For instance , consider one simple phenomena :
a question is typically followed by an answer ,
or some explicit statement of an inability or refusal to answer .'
token = nlp(sentence)
token_tag = [(word.text, word.pos_) for word in token]
输出如下:
[('For', 'ADP'),
('instance', 'NOUN'),
(',', 'PUNCT'),
('consider', 'VERB'),
('one', 'NUM'),
('simple', 'ADJ'),
('phenomena', 'NOUN'),
...]
对于名词短语或块,我可以得到 noun_chunks
这是一个词块,如下所示:
[nc for nc in token.noun_chunks] # [instance, one simple phenomena, an answer, ...]
我想知道是否有一种方法可以根据 noun_chunks
对 POS 标签进行聚类,以便得到输出
[('For', 'ADP'),
('instance', 'NOUN'), # or NOUN_CHUNKS
(',', 'PUNCT'),
('one simple phenomena', 'NOUN_CHUNKS'),
...]
我知道怎么做了。基本上,我们可以按如下方式获取名词短语标记的开始和结束位置:
noun_phrase_position = [(s.start, s.end) for s in token.noun_chunks]
noun_phrase_text = dict([(s.start, s.text) for s in token.noun_chunks])
token_pos = [(i, t.text, t.pos_) for i, t in enumerate(token)]
然后我结合这个 start
、stop
位置
token_pos
的列表
result = []
for start, end in noun_phrase_position:
result += token_pos[index:start]
result.append(token_pos[start:end])
index = end
result_merge = []
for i, r in enumerate(result):
if len(r) > 0 and isinstance(r, list):
result_merge.append((r[0][0], noun_phrase_text.get(r[0][0]), 'NOUN_PHRASE'))
else:
result_merge.append(r)
输出
[(1, 'instance', 'NOUN_PHRASE'),
(2, ',', 'PUNCT'),
(3, 'consider', 'VERB'),
(4, 'one simple phenomena', 'NOUN_PHRASE'),
(7, ':', 'PUNCT'),
(8, 'a', 'DET'), ...