Spark NLP 规范器中的正则表达式无法正常工作

Regex in Spark NLP Normalizer is not working correctly

我正在使用 Spark NLP 管道预处理我的数据。标准化器不仅会删除标点符号,还会删除变音符号。

我的代码:

documentAssembler = DocumentAssembler() \
    .setInputCol("column") \
    .setOutputCol("column_document")\
    .setCleanupMode('shrink_full')

tokenizer = Tokenizer() \
  .setInputCols(["column_document"]) \
  .setOutputCol("column_token") \
  .setMinLength(2)\
  .setMaxLength(30)
  
normalizer = Normalizer() \
    .setInputCols(["column_token"]) \
    .setOutputCol("column_normalized")\
    .setCleanupPatterns(["[^\w -]|_|-(?!\w)|(?<!\w)-"])\
    .setLowercase(True)\

示例:

Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller, die schmecken besonders gut!

输出:

Ich esse gerne pfel vom Biobauernhof Reutter Mller die schmecken besonders gut

预期输出:

Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller die schmecken besonders gut

\w 模式默认不识别 Unicode,您需要使用正则表达式选项使其识别 Unicode。在这种情况下,使用 embedded flag option (?U):

更容易
"(?U)[^\w -]|_|-(?!\w)|(?<!\w)-"

文档中的更多详细信息:

When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.

The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression (?U).

The flag implies UNICODE_CASE, that is, it enables Unicode-aware case folding.