自定义模式以匹配 spacy 的匹配器中的短语

Custom pattern to match phrases in spacy's Matcher

我正在尝试使用 spacy 来匹配一些例句。我成功地尝试了示例代码,但现在我需要更具体的东西。先放个示例代码让大家更好理解:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])

doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

这很好用,但我现在需要更具体的东西:我需要 python 从文件中加载短语(每个句子都不同)并将其存储在内存中,然后查看 phrase1 ( Hello, world! Hello world! 在示例中)包含内存中的任何模式。这可能吗?如果是,请有人帮助或指导我,我真的不知道如何进行。 非常感谢!!

如果我没理解错的话,你想要:

  1. 读取一个外部文件,其中包含要匹配的字符串,在您的例子中是 Hello, world!
  2. 在加载的文件中查找您的模式。
  3. Return 上面的模式。

这应该有效:

# File contents:
"""./myfile.txt
This is one sentence. Hello world! This is another sentence.
Yet another sentence. Hello world... Hello, world!
"""

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])

matches = matcher(doc)

# Load file as string into memory: https://java2blog.com/python-read-file-into-string/
with open('myfile.txt') as f:
    doc = nlp(f.read())

# Use the pipeline's sentence recognizer: https://spacy.io/usage/linguistic-features#sbd
for sent in doc.sents:
    matches = matcher(sent)
    # From your code, just replace `doc` by `sent`
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = sent[start:end]  # The matched span
        print(match_id, string_id, start, end, span.text)

请注意,如果您的文件太大,您可能需要逐行阅读,如下所示:

with open(file) as f:
    for line in f:
    # do your stuff here