TypeError: Can't convert re.compile('[A-Z]+') (re.Pattern) to Union[str, tokenizers.Regex]

Question

我在将正则表达式应用于 HuggingFace 库中的 Split() 操作时遇到问题。 library 请求 Split() 的以下输入。

pattern (str or Regex) – A pattern used to split the string. Usually a string or a Regex

在我的代码中，我正在应用 Split() 操作，如下所示：

tokenizer.pre_tokenizer = Split(pattern="[A-Z]+", behavior='isolated')

但它不起作用，因为 [A-Z]+ 被解释为字符串而不是 Regex 表达式。我使用以下方法无济于事：

pattern = re.compile("[A-Z]+")
tokenizer.pre_tokenizer = Split(pattern=pattern, behavior='isolated')

出现以下错误：

TypeError: Can't convert re.compile('[A-Z]+') (re.Pattern) to Union[str, tokenizers.Regex]

Answer 1

以下解决方案通过从 tokenizers 库中导入 Regex 来实现：

from tokenizers import Regex

tokenizer.pre_tokenizer = Split(pattern=Regex("[A-Z]+"),
                                behavior='isolated')

python