在 spacy 中深度复制 PhraseMatcher 对象不起作用
Deep copying PhraseMatcher object in spacy is not working
我想 运行 一些多处理模块 运行 并行对文档进行一些短语匹配。为此,我考虑在一个进程中创建短语匹配对象,然后通过创建 PhraseMatcher 对象的副本在多个进程之间共享。代码似乎没有给出任何错误就失败了。为了让事情变得更简单,我尝试了这个来展示我想要实现的目标
import copy
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en')
color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('silk', 'yellow fabric')]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)
matcher2 = copy.deepcopy(matcher)
doc = nlp("yellow fabric")
matches = matcher2(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'COLOR'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
对于 matcher2
对象,它没有给出任何输出,但是对于 matcher
对象,我能够得到结果。
COLOR yellow
MATERIAL yellow fabric
我被困在这几天了。任何帮助将不胜感激。
谢谢。
问题的根源是 PhraseMatcher 是一个 Cython class,在文件 matcher.pyx 中定义和实现,Cython 不能与 deepcopy 一起正常工作。
引用自 this Whosebug 问题的已接受答案:
Cython doesn't like deepcopy on Classes which have function/method referenced variables. Those variable copies will fail.
但是,还有其他选择。如果你想 运行 PhraseMatcher 并行处理多个文档,你可以使用多线程和 PhraseMatcher 的 pipe 方法。
您的问题的可能解决方法:
import copy
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en_core_web_sm')
color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('silk', 'yellow fabric')]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)
doc1 = nlp('yellow fabric')
doc2 = nlp('red lipstick and big black boots')
for doc in matcher.pipe([doc1, doc2], n_threads=4):
matches = matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id]
span = doc[start : end]
print(rule_id, span.text)
希望对您有所帮助!
我想 运行 一些多处理模块 运行 并行对文档进行一些短语匹配。为此,我考虑在一个进程中创建短语匹配对象,然后通过创建 PhraseMatcher 对象的副本在多个进程之间共享。代码似乎没有给出任何错误就失败了。为了让事情变得更简单,我尝试了这个来展示我想要实现的目标
import copy
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en')
color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('silk', 'yellow fabric')]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)
matcher2 = copy.deepcopy(matcher)
doc = nlp("yellow fabric")
matches = matcher2(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'COLOR'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
对于 matcher2
对象,它没有给出任何输出,但是对于 matcher
对象,我能够得到结果。
COLOR yellow
MATERIAL yellow fabric
我被困在这几天了。任何帮助将不胜感激。
谢谢。
问题的根源是 PhraseMatcher 是一个 Cython class,在文件 matcher.pyx 中定义和实现,Cython 不能与 deepcopy 一起正常工作。
引用自 this Whosebug 问题的已接受答案:
Cython doesn't like deepcopy on Classes which have function/method referenced variables. Those variable copies will fail.
但是,还有其他选择。如果你想 运行 PhraseMatcher 并行处理多个文档,你可以使用多线程和 PhraseMatcher 的 pipe 方法。
您的问题的可能解决方法:
import copy
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en_core_web_sm')
color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('silk', 'yellow fabric')]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)
doc1 = nlp('yellow fabric')
doc2 = nlp('red lipstick and big black boots')
for doc in matcher.pipe([doc1, doc2], n_threads=4):
matches = matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id]
span = doc[start : end]
print(rule_id, span.text)
希望对您有所帮助!