使用带有特殊字符的 Spacy 分词器的问题

Question

我是 Spacy 的新手，我试图在文本中找到一些模式，但由于标记化的工作形式，我遇到了麻烦。例如，我创建了以下模式，尝试使用匹配器查找百分比元素，如“0,42%”（这不是我想要的，但我现在只是练习）：

nlp = spacy.load("pt_core_news_sm")

matcher = Matcher(nlp.vocab)

text = 'total: 1,80%:(comex 1,30% + deriv 0,50%/ativo: 1,17% '

pattern_test =  [{"TEXT": {"REGEX": "[0-9]+[,.]+[0-9]+[%]"}}]  

text_ = nlp(text)

matcher.add("pattern test", [pattern_test] )
result = matcher(text_)

for id_, beg, end in result:
    print(id_)
    print(text_[beg:end])

问题是它返回的结果如下所示，因为标记化认为这只是一个标记：

9844711491635719110
1,80%:(comex
9844711491635719110
0,50%/ativo

我尝试在字符串上使用 Python 的 .replace() 方法来替换空格中的特殊字符，然后再对其进行标记化，但现在当我打印标记化结果时，它会像这样分隔所有内容：

text_adjustment = text.replace(":", " ").replace("(", " ").replace(")", " ").replace("/", " ").replace(";", " ").replace("-", " ").replace("+", " ")

print([token for token in text_adjustment])

['t', 'o', 't', 'a', 'l', ' ', ' ', '1', ',', '8', '0', '%', ' ', ' ', 'c', 'o', 'm', 'e', 'x', ' ', '1', ',', '3', '0', '%', ' ', ' ', ' ', 'd', 'e', 'r', 'i', 'v', ' ', '0', ',', '5', '0', '%', ' ', 'a', 't', 'i', 'v', 'o', ' ', ' ', '1', ',', '1', '7', '%', ' ']

我希望分词结果是这样的：

['total', '1,80%', 'comex', '1,30%', 'deriv', '0,50%', 'ativo', '1,17%']

有更好的方法吗？我正在使用 'pt_core_news_sm' 模型，但如果需要，我可以更改语言。

提前致谢:)

Answer 1

我建议使用

import re
#...
text = re.sub(r'(\S)([/:()])', r' ', text)
pattern_test =  [{"TEXT": {"REGEX": r"^\d+[,.]\d+$"}}, {"ORTH": "%"}]

这里，(\S)([/:()])正则表达式用于匹配任何非白色space（将其捕获到第1组），然后匹配一个/，:，( 或 )（将其捕获到第 2 组）然后 re.sub 在这两个组之间插入一个 space。

^\d+[,.]\d+$ 正则表达式匹配包含浮点值的完整标记文本，% 是下一个标记文本（因为数字和 % 被拆分为单独的标记模型）。

完整的 Python 代码片段：

import spacy, re
from spacy.matcher import Matcher

#nlp = spacy.load("pt_core_news_sm")
nlp = spacy.load("en_core_web_trf")
matcher = Matcher(nlp.vocab)
text = 'total: 1,80%:(comex 1,30% + deriv 0,50%/ativo: 1,17% '
text = re.sub(r'(\S)([/:()])', r' ', text)
pattern_test =  [{"TEXT": {"REGEX": "\d+[,.]\d+"}}, {"ORTH": "%"}]  
text_ = nlp(text)

matcher.add("pattern test", [pattern_test] )
result = matcher(text_)

for id_, beg, end in result:
    print(id_)
    print(text_[beg:end])

输出：

9844711491635719110
1,80%
9844711491635719110
1,30%
9844711491635719110
0,50%
9844711491635719110
1,17%

使用带有特殊字符的 Spacy 分词器的问题

Problems using Spacy tokenizer with special characters

python

nlp

tokenize

spacy