SpaCy，如何创建一个模式来匹配通过 SpeechRecognition 捕获的字符串？

Question

第一次来这里寻求帮助，希望一切都清楚！事实：我正在为角色扮演游戏 (GURPS) 构建一个应用程序，该应用程序跟踪玩家对敌人造成的伤害。应用程序本身已经完成，我使用 PySimpleGUI 作为图形界面。下一步，是集成语音命令，以便输入不是来自键盘，而是来自语音（因为有多个输入，所以，为什么不呢？）。因此，我使用 SpeechRecognition 库来捕捉语音输入，创建一个字符串变量来存储用户的输入。现在我正在处理第二部分：从字符串中提取输入。最后一部分是将这些输入存储到字典中并将其用作我的函数的输入。

我想要达到的目标我在设计与 SpaCy 的比赛时遇到了很多问题。因为我认为没有数据库可以为我的任务训练 NN 或 ML 模型，所以我使用 Rules Matching。这样，任何句子都必须以某种方式进行结构化，以便按照我的意愿提取标记。这个句子的一个例子是： “你击中了敌人的僵尸，它有 2 个弱点，击中头部，用一次大的穿刺攻击，造成 8 点伤害”。

我必须提取的输入如下：

enemy hit: zombie one（是被敌人击中的，在创建的dataframe中可能存在zombie_1、zombie_2等...，一般来说，多个僵尸，有顺序附上编号。仍然尝试了解将它们命名为 zombie1、zombie2...)
漏洞“数量”
位置命中：在这种情况下是头部，但可能是“右臂”，我无法提取，因为标记化将它们视为 2 个标记而不是 1 个
大渗透：攻击类型（最简单的情况是“切割”或“压碎”，一个词，很容易理解，但我没有找到任何方法来提取这些标记，因为如何标记化工作）
伤害 8：造成的伤害

问题：我目前正在使用 DependencyMatcher。主要问题是：

因为tokenization作用于单个词，在上面说的情况下，我会失去第二部分（右臂，只提取臂；大穿透，只提取）。
无法概括我的模式，我不确定“DependencyMatcher”是否适合这里的工具。我正在使用意大利语，但我正在用英语测试 semplicity。我当前的英语脚本是：

string = "You hit the enemy zombie one, that has vulnerability 2, to the head, with a large piercing attack, dealing 8 damage."
    nlp = spacy.load("en_core_news_sm")
    doc = nlp(string)
    # for token in doc:
    #     print(token.text, token.dep_)
    
# i'm going to create 2 lists with all words of body locations hit and type of attacks, in order to find the words via "LOWER" or "LEMMA" dependency (first part of list is in english, second part in italian)
   
    body_list_words = ["Body", "Head", "Arm_right", "Arm_left", "Leg_right", "Leg_left", "Hand_right", "Hand_left", "Foot_right", "Foot_left",
                 "Groin", "Skull", "Vitals", "Neck", "corpo", "testa", "braccio destro", "braccio sinistro", "gamba destra", "gamba sinistra",
                       "mano destra", "mano sinistra", "piede destro", "piede sinistro", "testicoli", "cranio", "vitali", "collo"]

    attack_type_words = ["cutting", "impaling", "crushing", "small penetration", "penetration", "big penetration", "huge penetration",
                          "burning", "explosive", "tagliente", "impalamento", "schiacciamento", "penetrazione minore", "piccola penetrazione",
                          "penetrazione", "penetrazione maggiore", "enorme penetrazione", "infuocati", "esplosivi"]


    ###############
    # Trovare i match
    ##############
    matcher = DependencyMatcher(nlp.vocab)
    # I'm starting finding the verb
    patterns = [{"RIGHT_ID": "anchor_verbo",
                 "RIGHT_ATTRS": {"POS": "VERB"}},
    
    # Looking for Obj (word: enemy)
                {"LEFT_ID": "anchor_verbo",
                 "REL_OP": ">",
                 "RIGHT_ID": "obj_verbo",
                 "RIGHT_ATTRS": {"DEP": "obj"}},

    # Looking for the name of the enemy: zombie1
                {"LEFT_ID": "obj_verbo",
                 "REL_OP": ">",
                 "RIGHT_ID": "type_enemy",
                 "RIGHT_ATTRS": {"DEP": "nmod"}},
    
     # Looking for word: vulnerability
                {"LEFT_ID": "anchor_verbo",
                 "REL_OP": ">",
                 "RIGHT_ID": "vulnerability",
                 "RIGHT_ATTRS": {"LEMMA": "vulnerability"}},

    #Looking for number associated to Vulnerability
                {"LEFT_ID": "vulnerability",
                 "REL_OP": ">",
                 "RIGHT_ID": "num_vulnerability",
                 "RIGHT_ATTRS": {"DEP": "nummod"}},

    #location of body hit
                {"LEFT_ID": "anchor_verbo",
                 "REL_OP": ">>",
                 "RIGHT_ID": "location",
                 "RIGHT_ATTRS": {"LOWER": {"IN": body_list_words}}},

   # Looking for word: attack, in order to find the type of attack
                {"LEFT_ID": "anchor_verbo",
                 "REL_OP": ">>",
                 "RIGHT_ID": "attack",
                 "RIGHT_ATTRS": {"POS": "NOUN"}},

    #Looking for type of attack
                {"LEFT_ID": "attack",
                 "REL_OP": ">>",
                 "RIGHT_ID": "type_attack",
                 "RIGHT_ATTRS": {"LEMMA": {"IN": attack_type_words}}},

    #Looking for word: damage in order to extract the number
                {"LEFT_ID": "attack",
                 "REL_OP": ">>",
                 "RIGHT_ID": "word_damage",
                 "RIGHT_ATTRS": {"DEP": "nmod"}},

    # Looking for the number
                {"LEFT_ID": "word_damage",
                 "REL_OP": ">>",
                 "RIGHT_ID": "num_damage",
                 "RIGHT_ATTRS": {"DEP": "nummod"}}

                ]

    matcher.add("Inputs1", [patterns])
    matches = matcher(doc)

    match_id, token_ids = matches[0]
    matched_words = []
    for i in range(len(token_ids)):
        #print(patterns[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text)
        matched_words.append(doc[token_ids[i]].text)
    
#########
# Now i'm creating the dictionary, deleting first element
#########
    index_to_remove = [0]
    for index, elem in enumerate(index_to_remove):
        del matched_words[elem]
    print(matched_words)

    input_dict = {matched_words[0]: matched_words[1], "location": matched_words[4], matched_words[5]: matched_words[6],
                  matched_words[7]: matched_words[8], matched_words[2]: matched_words[3]}

    #print(input_dict)
    return input_dict

要解决的一般问题：任何应该分组在一起的复杂词（如“right arm”、“left leg”、“large penetration”）都不能以这种方式提取（只能提取 arm、leg 或 penetration将被退回）。

你能帮帮我吗？谢谢！

Answer 1

总结一下你的问题，你得到的是单个词，但你想捕获作为一个单元的多个词，比如“右臂”。

您可以使用依赖项匹配器来完成此操作，但这需要一些工作。基本上你想匹配你现在得到的单个单词的整个 subtree。在“right arm”这个短语中，“arm”是中心名词，“right”将依赖于“arm”。所有直接或间接（通过其他词）依赖于“arm”的词都称为“子树”。

了解依赖关系有点复杂但非常强大。我建议您阅读 the Jurafsky and Martin book 中的第 14 章，这是依赖解析的直接指南。随意浏览很多内容。

也就是说，对于您想要的那种短语，您可以在 spaCy 中尝试一种更简单的方法。尝试使用 merge_noun_chunks 函数，它将块转换为更易于使用的单个标记。

名词块有点难以定义，它在spaCy中的工作方式可能不是你想要的，但如果你愿意，你也可以查看它的源代码来编写你自己的定义。为了让它工作，你必须了解依赖解析。

SpaCy，如何创建一个模式来匹配通过 SpeechRecognition 捕获的字符串？

SpaCy, how to create a pattern to match a string caught via SpeechRecognition?

python

speech-to-text

spacy