使用 Spacy 提取动词短语
Extract verb phrases using Spacy
我一直在使用 Spacy 使用 Spacy 提供的 Doc.noun_chunks 属性 来提取名词块。
我如何使用 Spacy 库(形式 'VERB ? ADV * VERB +' )从输入文本中提取动词短语?
这可能对您有所帮助。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The author is writing a new book.'
pattern = r'<VERB>?<ADV>*<VERB>+'
doc = textacy.Doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
print(list.text)
输出:
is writing
关于如何突出显示动词短语,请查看下面的 link。
另一种方法:
最近观察到Textacy对正则表达式匹配做了一些改动。基于这种方法,我尝试了这种方式。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. He dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
doc = textacy.make_spacy_doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.matches(doc, pattern)
for list in lists:
print(list.text)
输出:
sat
jumped
writing
我检查了这个 link 中的 POS 匹配,结果似乎不是预期的结果。
[https://explosion.ai/demos/matcher][1]
有没有人尝试过使用 POS 标签而不是 Regexp 模式来查找动词短语?
编辑 2:
import spacy
from spacy.matcher import Matcher
from spacy.util import filter_spans
nlp = spacy.load('en_core_web_sm')
sentence = 'The cat sat on the mat. He quickly ran to the market. The dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'AUX', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
matcher.add("Verb phrase", None, pattern)
doc = nlp(sentence)
# call the matcher to find matches
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
print (filter_spans(spans))
输出:
[sat, quickly ran, jumped, is writing]
基于 mdmjsh 回答的帮助。
Edit3:奇怪的行为。
以下句子的以下模式动词短语在 https://explosion.ai/demos/matcher
中被正确识别
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
那只黑猫一定在院子里喵喵叫真的很响。
但是从代码中 运行 输出以下内容。
[必须,真的喵喵]
上面的答案参考了textacy
,这都可以用Spacy
直接用Matcher实现,不需要包装库。
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm') # download model first
sentence = 'The author was staring pensively as she wrote'
pattern=[{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'OP': '*'}, # additional wildcard - match any text in between
{'POS': 'VERB', 'OP': '+'}]
# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
# Add pattern to matcher
matcher.add("verb-phrases", None, pattern)
doc = nlp(sentence)
# call the matcher to find matches
matches = matcher(doc)
N.b。这个 returns 一个元组列表,包含每个元组的匹配 ID 和开始、结束索引
匹配,例如:
[(15658055046270554203, 0, 4),
(15658055046270554203, 1, 4),
(15658055046270554203, 2, 4),
(15658055046270554203, 3, 4),
(15658055046270554203, 0, 8),
(15658055046270554203, 1, 8),
(15658055046270554203, 2, 8),
(15658055046270554203, 3, 8),
(15658055046270554203, 4, 8),
(15658055046270554203, 5, 8),
(15658055046270554203, 6, 8),
(15658055046270554203, 7, 8)]
您可以使用索引将这些匹配转换为跨度。
spans = [doc[start:end] for _, start, end in matches]
# output
"""
The author was staring
author was staring
was staring
staring
The author was staring pensively as she wrote
author was staring pensively as she wrote
was staring pensively as she wrote
staring pensively as she wrote
pensively as she wrote
as she wrote
she wrote
wrote
"""
注意,我将额外的 {'OP': '*'},
添加到模式中,当使用特定 POS/DEP 指定注释时用作通配符(即它将匹配任何文本)。这在这里很有用,因为问题是关于动词短语的——格式 VERB, ADV, VERB 是一个不寻常的结构(试着想一些例句),但是 VERB, ADV, [other text], VERB is likely (as given in例句 'The author was staring pensively as she wrote')。或者,您可以优化模式以使其更具体 (displacy is your friend here)。
进一步注意,由于匹配器的贪婪,匹配的所有排列都被返回。您可以选择使用 filter_spans 删除重复项或重叠项,将其缩减为最长形式。
from spacy.util import filter_spans
filter_spans(spans)
# output
[The author was staring pensively as she wrote]
我一直在使用 Spacy 使用 Spacy 提供的 Doc.noun_chunks 属性 来提取名词块。 我如何使用 Spacy 库(形式 'VERB ? ADV * VERB +' )从输入文本中提取动词短语?
这可能对您有所帮助。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The author is writing a new book.'
pattern = r'<VERB>?<ADV>*<VERB>+'
doc = textacy.Doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
print(list.text)
输出:
is writing
关于如何突出显示动词短语,请查看下面的 link。
另一种方法:
最近观察到Textacy对正则表达式匹配做了一些改动。基于这种方法,我尝试了这种方式。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. He dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
doc = textacy.make_spacy_doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.matches(doc, pattern)
for list in lists:
print(list.text)
输出:
sat
jumped
writing
我检查了这个 link 中的 POS 匹配,结果似乎不是预期的结果。
[https://explosion.ai/demos/matcher][1]
有没有人尝试过使用 POS 标签而不是 Regexp 模式来查找动词短语?
编辑 2:
import spacy
from spacy.matcher import Matcher
from spacy.util import filter_spans
nlp = spacy.load('en_core_web_sm')
sentence = 'The cat sat on the mat. He quickly ran to the market. The dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'AUX', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
matcher.add("Verb phrase", None, pattern)
doc = nlp(sentence)
# call the matcher to find matches
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
print (filter_spans(spans))
输出:
[sat, quickly ran, jumped, is writing]
基于 mdmjsh 回答的帮助。
Edit3:奇怪的行为。 以下句子的以下模式动词短语在 https://explosion.ai/demos/matcher
中被正确识别pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
那只黑猫一定在院子里喵喵叫真的很响。
但是从代码中 运行 输出以下内容。
[必须,真的喵喵]
上面的答案参考了textacy
,这都可以用Spacy
直接用Matcher实现,不需要包装库。
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm') # download model first
sentence = 'The author was staring pensively as she wrote'
pattern=[{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'OP': '*'}, # additional wildcard - match any text in between
{'POS': 'VERB', 'OP': '+'}]
# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
# Add pattern to matcher
matcher.add("verb-phrases", None, pattern)
doc = nlp(sentence)
# call the matcher to find matches
matches = matcher(doc)
N.b。这个 returns 一个元组列表,包含每个元组的匹配 ID 和开始、结束索引 匹配,例如:
[(15658055046270554203, 0, 4),
(15658055046270554203, 1, 4),
(15658055046270554203, 2, 4),
(15658055046270554203, 3, 4),
(15658055046270554203, 0, 8),
(15658055046270554203, 1, 8),
(15658055046270554203, 2, 8),
(15658055046270554203, 3, 8),
(15658055046270554203, 4, 8),
(15658055046270554203, 5, 8),
(15658055046270554203, 6, 8),
(15658055046270554203, 7, 8)]
您可以使用索引将这些匹配转换为跨度。
spans = [doc[start:end] for _, start, end in matches]
# output
"""
The author was staring
author was staring
was staring
staring
The author was staring pensively as she wrote
author was staring pensively as she wrote
was staring pensively as she wrote
staring pensively as she wrote
pensively as she wrote
as she wrote
she wrote
wrote
"""
注意,我将额外的 {'OP': '*'},
添加到模式中,当使用特定 POS/DEP 指定注释时用作通配符(即它将匹配任何文本)。这在这里很有用,因为问题是关于动词短语的——格式 VERB, ADV, VERB 是一个不寻常的结构(试着想一些例句),但是 VERB, ADV, [other text], VERB is likely (as given in例句 'The author was staring pensively as she wrote')。或者,您可以优化模式以使其更具体 (displacy is your friend here)。
进一步注意,由于匹配器的贪婪,匹配的所有排列都被返回。您可以选择使用 filter_spans 删除重复项或重叠项,将其缩减为最长形式。
from spacy.util import filter_spans
filter_spans(spans)
# output
[The author was staring pensively as she wrote]