spacy lemmatizer 是如何工作的?

How does spacy lemmatizer works?

词形还原 spacy 有一个 lists of words: adjectives, adverbs, verbs... and also lists for exceptions: adverbs_irreg... for the regular ones there is a set of rules

让我们以单词 "wider"

为例

因为它是一个形容词,词形还原规则应该从这个列表中获取:

ADJECTIVE_RULES = [
    ["er", ""],
    ["est", ""],
    ["er", "e"],
    ["est", "e"]
] 

据我了解,流程是这样的:

1) 获取单词的词性标记,知道它是名词还是动词...
2) 如果单词在不规则情况列表中,如果没有应用任何规则,则直接替换。

现在,如何决定使用 "er" -> "e" 而不是 "er"-> "" 来获得 "wide" 而不是 "wid"?

Here可以测试

TLDR:spaCy 检查它试图生成的词条是否在已知的单词列表或该部分的例外情况中。

长答案:

查看 lemmatizer.py 文件,特别是底部的 lemmatize 函数。

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)

例如,对于英语形容词,它接受我们正在评估的字符串、已知形容词的 indexexceptionsrules,因为你'已参考,来自 this directory(英文模型)。

我们在 lemmatize 中将字符串变为小写后做的第一件事是检查该字符串是否在我们的已知异常列表中,其中包括 "worse" -> "bad".

然后我们遍历 rules 并将每个应用到字符串(如果适用)。对于单词 wider,我们将应用以下规则:

["er", ""],
["est", ""],
["er", "e"],
["est", "e"]

我们将输出以下形式:["wid", "wide"].

然后,我们检查这种形式是否在我们 index 的已知形容词中。如果是,我们将其附加到表格中。否则,我们将它添加到 oov_forms,我猜这是词汇量不足的缩写。 wide 在索引中,所以它被添加了。 wid 被添加到 oov_forms

最后,我们 return 一组找到的词条,或匹配规则但不在我们索引中的任何词条,或只是单词本身。

您上面发布的单词词形还原 link 适用于 wider,因为 wide 在单词索引中。尝试类似 He is blandier than I. spaCy 会将 blandier(我编的词)标记为形容词,但它不在索引中,因此它只会将 return blandier 作为引理.

每种词类型(形容词、名词、动词、副词)都有一组规则和一组已知词。映射发生 here:

INDEX = {
    "adj": ADJECTIVES,
    "adv": ADVERBS,
    "noun": NOUNS,
    "verb": VERBS
}


EXC = {
    "adj": ADJECTIVES_IRREG,
    "adv": ADVERBS_IRREG,
    "noun": NOUNS_IRREG,
    "verb": VERBS_IRREG
}


RULES = {
    "adj": ADJECTIVE_RULES,
    "noun": NOUN_RULES,
    "verb": VERB_RULES,
    "punct": PUNCT_RULES
}

然后在 lemmatizer.py 中的 this line 加载正确的索引、规则和 exc(excl 我认为代表例外,例如不规则示例):

lemmas = lemmatize(string, self.index.get(univ_pos, {}),
                   self.exc.get(univ_pos, {}),
                   self.rules.get(univ_pos, []))

所有剩余的逻辑都在函数 lemmatize 中并且非常短。我们执行以下操作:

  1. 如果有异常(即单词不规则)包括提供的字符串,则使用它并将其添加到词形还原形式
  2. 对于每个规则,按照为所选词类型给出的顺序检查它是否与给定的词匹配。如果它确实尝试应用它。

    2a。如果应用规则后该单词在已知单词列表(即索引)中,则将其添加到单词的词形化形式

    2b。否则将单词添加到名为 oov_forms 的单独列表中(我相信 oov 代表 "out of vocabulary")

  3. 如果我们使用上述规则找到至少一种形式,我们 return 找到的形式列表,否则我们 return oov_forms 列表。

让我们从 class 定义开始:https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

Class

它从初始化 3 个变量开始:

class Lemmatizer(object):
    @classmethod
    def load(cls, path, index=None, exc=None, rules=None):
        return cls(index or {}, exc or {}, rules or {})

    def __init__(self, index, exceptions, rules):
        self.index = index
        self.exc = exceptions
        self.rules = rules

现在,查看英语的 self.exc,我们看到它指向 https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py where it's loading files from the directory https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer

为什么 Spacy 不只读一个文件?

很可能是因为在代码中声明字符串比通过 I/O 流式传输字符串更快。


这些索引、异常和规则从何而来?

仔细一看,好像都是从原来的Princeton WordNet https://wordnet.princeton.edu/man/wndb.5WN.html

规则

再仔细看,https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py is similar to the _morphy rules from nltk https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749

上的规则

而这些规则最初来自Morphy软件https://wordnet.princeton.edu/man/morphy.7WN.html

此外,spacy 包含了一些并非来自 Princeton Morphy 的标点符号规则:

PUNCT_RULES = [
    ["“", "\""],
    ["”", "\""],
    ["\u2018", "'"],
    ["\u2019", "'"]
]

异常

至于例外情况,它们存储在 spacy*_irreg.py 个文件中,它们看起来也来自 Princeton Wordnet。

很明显,如果我们查看原始 WordNet .exc(排除)文件(例如 https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc)的某些镜像,并且如果您从 [=18 下载 wordnet 包=],我们看到它是同一个列表:

alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc       cntlist.rev  data.noun  index.adv    index.verb  noun.exc
adv.exc       data.adj     data.verb  index.noun   lexnames    README
citation.bib  data.adv     index.adj  index.sense  LICENSE     verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc 
1490 adj.exc

索引

如果我们查看 spacy 词形还原器的 index,我们会发现它也来自 Wordnet,例如https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.pynltk 中重新分发的 wordnet 副本:

alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj 

  1 This software and database is being provided to you, the LICENSEE, by  
  2 Princeton University under the following license.  By obtaining, using  
  3 and/or copying this software and database, you agree that you have  
  4 read, understood, and will comply with these terms and conditions.:  
  5   
  6 Permission to use, copy, modify and distribute this software and  
  7 database and its documentation for any purpose and without fee or  
  8 royalty is hereby granted, provided that you agree to comply with  
  9 the following copyright notice and statements, including the disclaimer,  
  10 and that the same appear on ALL copies of the software, database and  
  11 documentation, including modifications that you make for internal  
  12 use or for distribution.  
  13   
  14 WordNet 3.0 Copyright 2006 by Princeton University.  All rights reserved.  
  15   
  16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
  17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
  18 IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
  19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
  20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
  21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
  22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
  23 OTHER RIGHTS.  
  24   
  25 The name of Princeton University or Princeton may not be used in  
  26 advertising or publicity pertaining to distribution of the software  
  27 and/or database.  Title to copyright in this software, database and  
  28 any associated documentation shall at all times remain with  
  29 Princeton University and LICENSEE agrees to preserve same.  
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"  
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"  
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"  
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"  
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex  
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base  
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part  
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part  
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 |  being born or beginning; "the nascent chicks"; "a nascent insurgency"   
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"  
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels  

基于 spacy lemmatizer 使用的词典、异常和规则主要来自 Princeton WordNet 和他们的 Morphy 软件,我们可以继续看 spacy 如何应用的实际实现使用索引和例外的规则。

我们回到https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

主要动作来自函数而不是 Lemmatizer class:

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    # TODO: Is this correct? See discussion in Issue #435.
    #if string in index:
    #    forms.append(string)
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)

为什么 lemmatize 方法在 Lemmatizer class 之外?

我不太确定,但也许是为了确保可以在 class 实例之外调用词形还原函数,但考虑到 @staticmethod and @classmethod 存在,也许还有其他考虑因素为什么函数和 class 已经解耦

Morphy vs Spacy

spacy lemmatize() 函数与十多年前创建的 morphy() function in nltk (which originally comes from http://blog.osteele.com/2004/04/pywordnet-20/ 进行比较),morphy(),Oliver Steele 的 Python 端口中的主要进程WordNet 形态是:

  1. 检查例外列表
  2. 对输入应用规则一次以获得 y1、y2、y3 等
  3. Return 数据库中的所有内容(并检查原始文件)
  4. 如果没有匹配项,继续应用规则直到找到匹配项
  5. Return 如果找不到任何内容,则为空列表

对于 spacy,鉴于第 https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76

行的 TODO,它可能仍在开发中

不过大概的流程好像是:

  1. 寻找例外,如果单词在例外列表中,则从例外列表中获取它们。
  2. 应用规则
  3. 保存索引列表中的那些
  4. 如果步骤 1-3 中没有引理,则只需跟踪词汇外词 (OOV) 并将原始字符串附加到引理形式
  5. Return 引理形式

在 OOV 处理方面,spacy returns 原始字符串,如果没有找到词形还原形式,在这方面,morphynltk 实现也是如此,例如

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'

在词形还原前检查不定式

可能另一个不同点是 morphyspacy 如何决定分配给单词的 POS。在这方面,spacy puts some linguistics rule in the Lemmatizer() to decide whether a word is the base form and skips the lemmatization entirely if the word is already in the infinitive form (is_base_form()),如果要对语料库中的所有单词进行词形还原,并且其中很大一部分是不定式(已经是引理形式),这将节省很多。

但这在 spacy 中是可能的,因为它允许词形还原器访问与某些形态规则密切相关的 POS。而对于 morphy 虽然可以使用细粒度的 PTB POS 标签找出一些形态,但仍然需要一些努力来整理它们以了解哪些形式是不定式。

总的来说,形态特征的3个主要信号需要在POS标签中梳理出来:

  • 个数
  • 性别

已更新

SpaCy 在最初的回答(2017 年 5 月 12 日)之后确实对词形还原器进行了更改。我认为目的是在没有查找和规则处理的情况下更快地进行词形还原。

因此他们对单词进行词形还原并将它们留在查找哈希中 -table 以便对他们已经进行词形还原的单词进行检索 O(1) https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py

此外,为了统一各种语言的词形还原器,词形还原器现在位于 https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92

但是上面讨论的基本词形还原步骤仍然与当前的 spacy 版本相关(4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc)


结语

我想现在我们知道它适用于语言学规则等等,另一个问题是 "are there any non rule-based methods for lemmatization?"

但在回答之前的问题之前,"What exactly is a lemma?" 可能是更好的问题。