防止 Spacy 分词器在特定字符上拆分

Question

当使用 spacy 对句子进行分词时，我希望它不要在 /

上拆分为分词

示例：

import en_core_web_lg
nlp = en_core_web_lg.load()
for i in nlp("Get 10ct/liter off when using our App"):
    print(i)

输出：

Get
10ct
/
liter
off
when
using
our
App

我希望它像 Get , 10ct/liter, off, when ....

我能够找到如何添加更多方法来拆分为 spacy 的标记，但没有找到如何避免特定的拆分技术。

Answer 1

我建议使用自定义分词器，请参阅 Modifying existing rule sets:

import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

nlp = spacy.load("en_core_web_trf")
text = "Get 10ct/liter off when using our App"
# Modify tokenizer infix patterns
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\-\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        #r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp(text)
print([t.text for t in doc])
## =>  ['Get', '10ct/liter', 'off', 'when', 'using', 'our', 'App']

注意注释的 #r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), 行，我只是从 [:<>=/] 字符 class 中取出 / 字符。此规则在 letter/digit 和字母之间的 / 处拆分。

如果您仍需要将 '12/ct' 拆分为三个标记，则需要在 r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA) 行下方添加另一行：

r"(?<=[0-9])/(?=[{a}])".format(a=ALPHA),

防止 Spacy 分词器在特定字符上拆分

Prevent Spacy tokenizer from splitting on specific character

python

nlp

tokenize

spacy