Python 中的 Spacy Regex 短语匹配器
Spacy Regex Phrase Matcher in Python
在一个大型文本语料库中,我感兴趣的是提取句子中某处具有特定(动词-名词)或(形容词-名词)列表的每个句子。我有一长串清单,但这里有一个示例。在我的 MWE 中,我试图提取带有“write/wrote/writing/writes”和“book/s”的句子。我有大约30对这样的词。
这是我尝试过的方法,但它没有捕捉到大部分句子:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')
matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)
for sent in doc.sents:
if matcher(nlp(sent.lemma_)):
print(sent.text)
不幸的是,我只有一场比赛:
“在写这本书的时候,他不得不抵御外星人和恐龙。”
然而,我也希望得到“他写了他的第一本书”这句话。其他的write-books都以writer为名词,与其不匹配是有好处的。
问题是在 Matcher 中,默认情况下模式中的每个字典都对应 正好一个标记 。所以你的正则表达式不匹配任何数量的字符,它匹配任何一个标记,这不是你想要的。
要得到你想要的,你可以使用OP
值来指定你想要匹配任意数量的标记。请参阅文档中的 operators or quantifiers section。
但是,考虑到您的问题,您可能想要实际使用依赖匹配器,因此我重写了您的代码以也使用它。试试这个:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp("""
Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.
While writing this book, he had to fend off aliens and dinosaurs. Greene's second book might not have been written by him.
Greene's cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.""")
matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "write"},{"OP": "*"},{"LEMMA": "book"}]
matcher.add("testy", [pattern])
print("----- Using Matcher -----")
for sent in doc.sents:
if matcher(sent):
print(sent.text)
print("----- Using Dependency Matcher -----")
deppattern = [
{"RIGHT_ID": "wrote", "RIGHT_ATTRS": {"LEMMA": "write"}},
{"LEFT_ID": "wrote", "REL_OP": ">", "RIGHT_ID": "book",
"RIGHT_ATTRS": {"LEMMA": "book"}}
]
from spacy.matcher import DependencyMatcher
dmatcher = DependencyMatcher(nlp.vocab)
dmatcher.add("BOOK", [deppattern])
for _, (start, end) in dmatcher(doc):
print(doc[start].sent)
另一件不太重要的事情 - 你调用匹配器的方式有点奇怪。您可以传递匹配器文档或跨度,但它们绝对应该是自然文本,因此在句子上调用 .lemma_
并根据您的情况创建一个新文档,但通常应该避免。
在一个大型文本语料库中,我感兴趣的是提取句子中某处具有特定(动词-名词)或(形容词-名词)列表的每个句子。我有一长串清单,但这里有一个示例。在我的 MWE 中,我试图提取带有“write/wrote/writing/writes”和“book/s”的句子。我有大约30对这样的词。
这是我尝试过的方法,但它没有捕捉到大部分句子:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')
matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)
for sent in doc.sents:
if matcher(nlp(sent.lemma_)):
print(sent.text)
不幸的是,我只有一场比赛:
“在写这本书的时候,他不得不抵御外星人和恐龙。”
然而,我也希望得到“他写了他的第一本书”这句话。其他的write-books都以writer为名词,与其不匹配是有好处的。
问题是在 Matcher 中,默认情况下模式中的每个字典都对应 正好一个标记 。所以你的正则表达式不匹配任何数量的字符,它匹配任何一个标记,这不是你想要的。
要得到你想要的,你可以使用OP
值来指定你想要匹配任意数量的标记。请参阅文档中的 operators or quantifiers section。
但是,考虑到您的问题,您可能想要实际使用依赖匹配器,因此我重写了您的代码以也使用它。试试这个:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp("""
Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.
While writing this book, he had to fend off aliens and dinosaurs. Greene's second book might not have been written by him.
Greene's cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.""")
matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "write"},{"OP": "*"},{"LEMMA": "book"}]
matcher.add("testy", [pattern])
print("----- Using Matcher -----")
for sent in doc.sents:
if matcher(sent):
print(sent.text)
print("----- Using Dependency Matcher -----")
deppattern = [
{"RIGHT_ID": "wrote", "RIGHT_ATTRS": {"LEMMA": "write"}},
{"LEFT_ID": "wrote", "REL_OP": ">", "RIGHT_ID": "book",
"RIGHT_ATTRS": {"LEMMA": "book"}}
]
from spacy.matcher import DependencyMatcher
dmatcher = DependencyMatcher(nlp.vocab)
dmatcher.add("BOOK", [deppattern])
for _, (start, end) in dmatcher(doc):
print(doc[start].sent)
另一件不太重要的事情 - 你调用匹配器的方式有点奇怪。您可以传递匹配器文档或跨度,但它们绝对应该是自然文本,因此在句子上调用 .lemma_
并根据您的情况创建一个新文档,但通常应该避免。