自定义模式以匹配 spacy 的匹配器中的短语
Custom pattern to match phrases in spacy's Matcher
我正在尝试使用 spacy 来匹配一些例句。我成功地尝试了示例代码,但现在我需要更具体的东西。先放个示例代码让大家更好理解:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])
doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
这很好用,但我现在需要更具体的东西:我需要 python 从文件中加载短语(每个句子都不同)并将其存储在内存中,然后查看 phrase1 ( Hello, world! Hello world! 在示例中)包含内存中的任何模式。这可能吗?如果是,请有人帮助或指导我,我真的不知道如何进行。
非常感谢!!
如果我没理解错的话,你想要:
- 读取一个外部文件,其中包含要匹配的字符串,在您的例子中是
Hello, world!
- 在加载的文件中查找您的模式。
- Return 上面的模式。
这应该有效:
# File contents:
"""./myfile.txt
This is one sentence. Hello world! This is another sentence.
Yet another sentence. Hello world... Hello, world!
"""
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])
matches = matcher(doc)
# Load file as string into memory: https://java2blog.com/python-read-file-into-string/
with open('myfile.txt') as f:
doc = nlp(f.read())
# Use the pipeline's sentence recognizer: https://spacy.io/usage/linguistic-features#sbd
for sent in doc.sents:
matches = matcher(sent)
# From your code, just replace `doc` by `sent`
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = sent[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
请注意,如果您的文件太大,您可能需要逐行阅读,如下所示:
with open(file) as f:
for line in f:
# do your stuff here
我正在尝试使用 spacy 来匹配一些例句。我成功地尝试了示例代码,但现在我需要更具体的东西。先放个示例代码让大家更好理解:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])
doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
这很好用,但我现在需要更具体的东西:我需要 python 从文件中加载短语(每个句子都不同)并将其存储在内存中,然后查看 phrase1 ( Hello, world! Hello world! 在示例中)包含内存中的任何模式。这可能吗?如果是,请有人帮助或指导我,我真的不知道如何进行。 非常感谢!!
如果我没理解错的话,你想要:
- 读取一个外部文件,其中包含要匹配的字符串,在您的例子中是
Hello, world!
- 在加载的文件中查找您的模式。
- Return 上面的模式。
这应该有效:
# File contents:
"""./myfile.txt
This is one sentence. Hello world! This is another sentence.
Yet another sentence. Hello world... Hello, world!
"""
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])
matches = matcher(doc)
# Load file as string into memory: https://java2blog.com/python-read-file-into-string/
with open('myfile.txt') as f:
doc = nlp(f.read())
# Use the pipeline's sentence recognizer: https://spacy.io/usage/linguistic-features#sbd
for sent in doc.sents:
matches = matcher(sent)
# From your code, just replace `doc` by `sent`
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = sent[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
请注意,如果您的文件太大,您可能需要逐行阅读,如下所示:
with open(file) as f:
for line in f:
# do your stuff here