XQuery 正则表达式可以匹配空字符吗?
Can XQuery regex match a null character?
我想从字符串中删除所有 NULL 字符。我知道正确的正则表达式匹配应该是 \x00 并且我尝试了以下 XQuery:
replace($message, '\x00', '')
导致错误:
exerr:ERROR Conversion from XPath2 to Java regular expression syntax failed: Error at character 1 in regular expression \x00: invalid escape sequence
是否有针对此问题的快速解决方案或变通方法?我使用 eXist-db 2.2.
简短版本:您不能,至少不能在 XQuery 和 XML 规范的范围内。可能有一个我不知道的 eXist-DB 专有方法(比如从 XQuery 本地连接 Java 正则表达式函数,seems to be possible in eXist DB),但我不认为这是 "quick solution or workaround".
浏览 XPath and XQuery Functions and Operators 3.0 specification which also contains the definition of regular expressions for XQuery 3.0, there is no specified way of escaping characters by their unicode code point. The \x00
syntax is specific to some regular expression implementations. regular-expressions.info verifies this assumption:
XML regular expressions don't have any tokens like \xFF
or \uFFFF
to match particular (non-printable) characters. You have to add them as literal characters to your regex. If you are entering the regex into an XML file using a plain text editor, then you can use the 
XML syntax. Otherwise, you'll need to paste in the characters from a character map.
考虑到这一点,可能有两种选择:
使用XML实体来表示空字节。这也是不可能的,因为XML规范不允许通过 definition in Extensible Markup Language (XML) 1.0 (Fifth Edition):
控制字符
CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'
加上额外的restriction of allowed characters in the same specification:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
XML 1.1 extends this definition to control characters -- 包含除空字节之外的所有字节:
Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
最后,XQuery relies on the same specification considering allowed characters:
Char ::= [http://www.w3.org/TR/REC-xml#NT-Char]
直接在XQuery文档中包含空字节。除了实践中的问题(在文件中包含空字节通常会导致各种意想不到的问题), same limitations to characters as defined above apply (格式良好的 XML 文档必须只包含上面定义的字符):
document ::= ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
在 Why are “control” characters illegal in XML 1.0?
中对此进行了扩展讨论
基本上,答案是字符串中不能有任何 NUL (x00) 字符。 XML,因此 XDM 数据模型不允许它们。因此,如果它们出现在您的输入中,则您已经超出了标准的范围。
我想从字符串中删除所有 NULL 字符。我知道正确的正则表达式匹配应该是 \x00 并且我尝试了以下 XQuery:
replace($message, '\x00', '')
导致错误:
exerr:ERROR Conversion from XPath2 to Java regular expression syntax failed: Error at character 1 in regular expression \x00: invalid escape sequence
是否有针对此问题的快速解决方案或变通方法?我使用 eXist-db 2.2.
简短版本:您不能,至少不能在 XQuery 和 XML 规范的范围内。可能有一个我不知道的 eXist-DB 专有方法(比如从 XQuery 本地连接 Java 正则表达式函数,seems to be possible in eXist DB),但我不认为这是 "quick solution or workaround".
浏览 XPath and XQuery Functions and Operators 3.0 specification which also contains the definition of regular expressions for XQuery 3.0, there is no specified way of escaping characters by their unicode code point. The \x00
syntax is specific to some regular expression implementations. regular-expressions.info verifies this assumption:
XML regular expressions don't have any tokens like
\xFF
or\uFFFF
to match particular (non-printable) characters. You have to add them as literal characters to your regex. If you are entering the regex into an XML file using a plain text editor, then you can use the
XML syntax. Otherwise, you'll need to paste in the characters from a character map.
考虑到这一点,可能有两种选择:
使用XML实体来表示空字节。这也是不可能的,因为XML规范不允许通过 definition in Extensible Markup Language (XML) 1.0 (Fifth Edition):
控制字符CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'
加上额外的restriction of allowed characters in the same specification:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
XML 1.1 extends this definition to control characters -- 包含除空字节之外的所有字节:
Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
最后,XQuery relies on the same specification considering allowed characters:
Char ::= [http://www.w3.org/TR/REC-xml#NT-Char]
直接在XQuery文档中包含空字节。除了实践中的问题(在文件中包含空字节通常会导致各种意想不到的问题), same limitations to characters as defined above apply (格式良好的 XML 文档必须只包含上面定义的字符):
document ::= ( prolog element Misc* ) - ( Char* RestrictedChar Char* )
在 Why are “control” characters illegal in XML 1.0?
中对此进行了扩展讨论
基本上,答案是字符串中不能有任何 NUL (x00) 字符。 XML,因此 XDM 数据模型不允许它们。因此,如果它们出现在您的输入中,则您已经超出了标准的范围。