spacy 匹配器：如果模式匹配则检测句子的第一个单词

Question

我的 spacy 匹配器显示出意外行为，我不明白为什么。考虑以下玩具数据：

# %% Load packages
import pandas as pd
import spacy

# %% toy data
df = \
    pd.DataFrame(columns=['date',         'ground_cat',   'ground_word',    'sentence'],
                  data = [["2009-09-01",  "a",            'wutschäumend'  , "Wutschäumend bin ich."], # in this line, "Wutschäumend" should match
                          ["2009-09-01",  "a",            'wutschäumend',   "Ich bin Wutschäumend."],
                          ["2009-09-01",  "neg_a",        'wutschäumend'  , "Ich bin nicht wutschäumend."],
                          ["2009-09-01",  "b",            'zweifelhaftes' , "Peter hat ein zweifelhaftes Verständnis von Gerechtigkeit."],
                          ["2009-09-01",  "c",            'unsittlich',     "Das ist unsittlich."],
                          ["2009-09-01",  "d",            'unsolidarisch' , "Niemand ist so unsolidarisch wie er."]])

df['processed_sentence'] = [doc for doc in nlp.pipe(df['sentence'].tolist())]

ground_x 识别 ground_truth，例如在第一行中，类别 a 应该通过查找单词 wutschäumend 等来匹配

我现在准备匹配器并实例化模式。基本上，如果 matching_dict 中的单词位于句子的开头，或者它们位于句子中的某个位置但前面没有否定词之一，我希望它们匹配。

这些是模式：

# %% Prepare Matcher
nlp = spacy.load("de_core_news_lg")
matcher = spacy.matcher.Matcher(nlp.vocab)  # instantiate Matcher

negations = ["nicht", "nichts", "kein", "keine", "keinen", "keinem"] # negation words

matching_dict: dict = {"a": ['wutschäumend'],
                       "b": ['zweifelhaftes'],
                       "c": ["unsittlich"],
                       "d": ["unsolidarisch"]}

# patterns for non-negated words associated with each emotion
a = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['a']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['a']}}]]
b = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['b']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['b']}}]]
c = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['c']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['c']}}]]
d = [[{"IS_SENT_START": True}, {"LOWER": {"IN": matching_dict['d']}}], [{"LOWER": {"NOT_IN": negations}}, {"LOWER": {"IN": matching_dict['d']}}]]

matcher.add(201, a)
matcher.add(202, b)
matcher.add(203, c)
matcher.add(204, d)

现在，当我将其应用于玩具数据时，第一句应该匹配但不匹配，我无法弄清楚我的模式有什么问题。有人可以指出我的错误吗？

df['matches'] = df['processed_sentence'].apply(matcher)  # match patterns

df['matches']
#                 [] # should be [(201, 0, 2)]!
# 1    [(201, 1, 3)]
# 2               []
# 3    [(202, 2, 4)]
# 4    [(203, 1, 3)]
# 5    [(204, 2, 4)]
# Name: matches, dtype: object

提前致谢！

Answer 1

让我们看看这些模式。请记住，每个字典都是一个标记。

a = [
    [{"IS_SENT_START": True}, 
     {"LOWER": {"IN": matching_dict['a']}}], 
    [{"LOWER": {"NOT_IN": negations}}, 
     {"LOWER": {"IN": matching_dict['a']}}]]

这里有两种模式。

在第一个中，您有句子的第一个词，第二个词在您的 a 列表中。

在第二个中，你有一个不是否定的词，后面是 a 列表中的一个词。

您的第一个模式与句子开头的单词不匹配，这正是您希望它执行的操作。您需要为每个标记制作一个字典，因此它应该如下所示：

[{"IS_SENT_START": True, "LOWER": {"IN": matching_dict['a']}]

spacy 匹配器：如果模式匹配则检测句子的第一个单词

spacy matcher: detect first word of sentence if pattern is matched

python

spacy