Spark NLP 规范器中的正则表达式无法正常工作
Regex in Spark NLP Normalizer is not working correctly
我正在使用 Spark NLP 管道预处理我的数据。标准化器不仅会删除标点符号,还会删除变音符号。
我的代码:
documentAssembler = DocumentAssembler() \
.setInputCol("column") \
.setOutputCol("column_document")\
.setCleanupMode('shrink_full')
tokenizer = Tokenizer() \
.setInputCols(["column_document"]) \
.setOutputCol("column_token") \
.setMinLength(2)\
.setMaxLength(30)
normalizer = Normalizer() \
.setInputCols(["column_token"]) \
.setOutputCol("column_normalized")\
.setCleanupPatterns(["[^\w -]|_|-(?!\w)|(?<!\w)-"])\
.setLowercase(True)\
示例:
Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller, die schmecken besonders gut!
输出:
Ich esse gerne pfel vom Biobauernhof Reutter Mller die schmecken besonders gut
预期输出:
Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller die schmecken besonders gut
\w
模式默认不识别 Unicode,您需要使用正则表达式选项使其识别 Unicode。在这种情况下,使用 embedded flag option (?U)
:
更容易
"(?U)[^\w -]|_|-(?!\w)|(?<!\w)-"
文档中的更多详细信息:
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.
The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression (?U)
.
The flag implies UNICODE_CASE, that is, it enables Unicode-aware case folding.
我正在使用 Spark NLP 管道预处理我的数据。标准化器不仅会删除标点符号,还会删除变音符号。
我的代码:
documentAssembler = DocumentAssembler() \
.setInputCol("column") \
.setOutputCol("column_document")\
.setCleanupMode('shrink_full')
tokenizer = Tokenizer() \
.setInputCols(["column_document"]) \
.setOutputCol("column_token") \
.setMinLength(2)\
.setMaxLength(30)
normalizer = Normalizer() \
.setInputCols(["column_token"]) \
.setOutputCol("column_normalized")\
.setCleanupPatterns(["[^\w -]|_|-(?!\w)|(?<!\w)-"])\
.setLowercase(True)\
示例:
Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller, die schmecken besonders gut!
输出:
Ich esse gerne pfel vom Biobauernhof Reutter Mller die schmecken besonders gut
预期输出:
Ich esse gerne Äpfel vom Biobauernhof Reutter-Müller die schmecken besonders gut
\w
模式默认不识别 Unicode,您需要使用正则表达式选项使其识别 Unicode。在这种情况下,使用 embedded flag option (?U)
:
"(?U)[^\w -]|_|-(?!\w)|(?<!\w)-"
文档中的更多详细信息:
When this flag is specified then the (US-ASCII only) Predefined character classes and POSIX character classes are in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties.
The UNICODE_CHARACTER_CLASS mode can also be enabled via the embedded flag expression
(?U)
.The flag implies UNICODE_CASE, that is, it enables Unicode-aware case folding.