VB.NET 2010:使用 Regex 匹配 Java 多行注释
VB.NET 2010: Matching Java multiline comments with Regex
我想从文件中删除 (Java/C/C++/..) 多行注释。为此,我写了一个正则表达式:
/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/
此正则表达式适用于 Nodepad++ 和 Geany(搜索并全部替换为空)。正则表达式在 VB.NET.
中的行为不同
我正在使用:
Microsoft Visual Studio 2010 (Version 10.0.40219.1 SP1Rel)
Microsoft .NET Framework (4.7.02053 SP1Rel)
我运行 替换的文件并不复杂。我不需要处理任何可能开始或结束评论的引用文本。
@sln 感谢您的详细回复,我也会像您一样快速解释我的正则表达式!
/\* Find the beginning of the comment.
[^\*]* Match any chars, but not an asterisk.
We need to deal with finding an asterisk now:
(\*+[^\*/][^\*]*)* This regex breaks down to:
\*+ Consume asterisk(s).
[^\*/] Match any other char that is not an asterisk or a / (would end the comment!).
[^\*]* Match any other chars that are not asterisks.
( )* Try to find more asterisks followed by other chars.
\*+/ Match 1 to n asterisks and finish the comment with /.
这里有两个代码片段:
第一个:
text
/*
* block comment
*
*/ /* comment1 */ /* comment2 */
My text to keep.
/* more comments */
more text
第二个:
text
/*
* block comment
*
*/ /* comment1 *//* comment2 */
My text to keep.
/* more comments */
more text
唯一的区别是
之间的 space
/* comment1 *//* comment2 */
使用 Notepad++ 和 Geany 删除找到的匹配项对这两种情况都非常有效。对于第二个示例,使用来自 VB.NET 的正则表达式失败。删除后第二个示例的结果如下所示:
text
more text
但它应该是这样的:
text
My text to keep.
more text
我正在使用 System.Text.RegularExpressions:
Dim content As String = IO.File.ReadAllText(file_path_)
Dim multiline_comment_remover As Regex = New Regex("/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/")
content = multiline_comment_remover.Replace(content, "")
我希望使用 VB.NET 获得与使用 Notepad++ 和 Geany 相同的结果。正如 sln 回答的那样,我的正则表达式 "should work in a weird way"。问题是为什么 VB.NET 无法按预期处理此正则表达式?这个问题仍然悬而未决。
由于 sln 的回答使我的代码正常工作,我将接受此回答。虽然这并不能解释为什么 VB.NET 不喜欢我的正则表达式。感谢你的帮助!我学到了很多!
我认为您可以使用通用的 C++ 注释剥离器。
基本上是
Glbolly 在下面找到,替换为 </code> </p>
<p>演示 PCRE:<a href="https://regex101.com/r/UldYK5/1" rel="nofollow noreferrer">https://regex101.com/r/UldYK5/1</a><br>
演示 Python:<a href="https://regex101.com/r/avfSfB/1" rel="nofollow noreferrer">https://regex101.com/r/avfSfB/1</a></p>
<pre><code> # raw: (?m)((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\]|\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\[\S\s]|[^"\])*"|'(?:\[\S\s]|[^'\])*'|(?:\r?\n|[\S\s])[^/"'\\s]*)
# delimited: /(?m)((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\]|\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|((?:"[^"\]*(?:\[\S\s][^"\]*)*"|'[^'\]*(?:\[\S\s][^'\]*)*'|(?:\r?\n(?:(?=(?:^[ \t]*)?(?:\/\*|\/\/))|[^\/"'\\r\n]*))+|[^\/"'\\r\n]+)+|[\S\s][^\/"'\\r\n]*)/
(?m) # Multi-line modifier
( # (1 start), Comments
(?:
(?: ^ [ \t]* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
[ \t]* \r? \n
(?=
[ \t]*
(?: \r? \n | /\* | // )
)
)?
|
// # Start // comment
(?: # Possible line-continuation
[^\]
| \
(?: \r? \n )?
)*?
(?: # End // comment
\r? \n
(?= # <- To preserve formatting
[ \t]*
(?: \r? \n | /\* | // )
)
| (?= \r? \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
# Quotes
# ======================
(?: # Quote and Non-Comment blocks
"
[^"\]* # Double quoted text
(?: \ [\S\s] [^"\]* )*
"
| # --------------
'
[^'\]* # Single quoted text
(?: \ [\S\s] [^'\]* )*
'
| # --------------
(?: # Qualified Linebreak's
\r? \n
(?:
(?= # If comment ahead just stop
(?: ^ [ \t]* )?
(?: /\* | // )
)
| # or,
[^/"'\\r\n]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)
)+
| # --------------
[^/"'\\r\n]+ # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)+ # Grab multiple instances
| # or,
# ======================
# Pass through
[\S\s] # Any other char
[^/"'\\r\n]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end), Non - comments
如果您使用不支持断言的特定引擎,
那你就得用这个了。
但这不会保留格式。
用法同上。
# (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\]|\\n?)*?\n)|("(?:\[\S\s]|[^"\])*"|'(?:\[\S\s]|[^'\])*'|[\S\s][^/"'\]*)
( # (1 start), Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\] | \ \n? )*? # Possible line-continuation
\n # End // comment
) # (1 end)
|
( # (2 start), Non - comments
"
(?: \ [\S\s] | [^"\] )* # Double quoted text
"
| '
(?: \ [\S\s] | [^'\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
我想从文件中删除 (Java/C/C++/..) 多行注释。为此,我写了一个正则表达式:
/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/
此正则表达式适用于 Nodepad++ 和 Geany(搜索并全部替换为空)。正则表达式在 VB.NET.
中的行为不同我正在使用:
Microsoft Visual Studio 2010 (Version 10.0.40219.1 SP1Rel)
Microsoft .NET Framework (4.7.02053 SP1Rel)
我运行 替换的文件并不复杂。我不需要处理任何可能开始或结束评论的引用文本。
@sln 感谢您的详细回复,我也会像您一样快速解释我的正则表达式!
/\* Find the beginning of the comment.
[^\*]* Match any chars, but not an asterisk.
We need to deal with finding an asterisk now:
(\*+[^\*/][^\*]*)* This regex breaks down to:
\*+ Consume asterisk(s).
[^\*/] Match any other char that is not an asterisk or a / (would end the comment!).
[^\*]* Match any other chars that are not asterisks.
( )* Try to find more asterisks followed by other chars.
\*+/ Match 1 to n asterisks and finish the comment with /.
这里有两个代码片段:
第一个:
text
/*
* block comment
*
*/ /* comment1 */ /* comment2 */
My text to keep.
/* more comments */
more text
第二个:
text
/*
* block comment
*
*/ /* comment1 *//* comment2 */
My text to keep.
/* more comments */
more text
唯一的区别是
之间的 space/* comment1 *//* comment2 */
使用 Notepad++ 和 Geany 删除找到的匹配项对这两种情况都非常有效。对于第二个示例,使用来自 VB.NET 的正则表达式失败。删除后第二个示例的结果如下所示:
text
more text
但它应该是这样的:
text
My text to keep.
more text
我正在使用 System.Text.RegularExpressions:
Dim content As String = IO.File.ReadAllText(file_path_)
Dim multiline_comment_remover As Regex = New Regex("/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/")
content = multiline_comment_remover.Replace(content, "")
我希望使用 VB.NET 获得与使用 Notepad++ 和 Geany 相同的结果。正如 sln 回答的那样,我的正则表达式 "should work in a weird way"。问题是为什么 VB.NET 无法按预期处理此正则表达式?这个问题仍然悬而未决。
由于 sln 的回答使我的代码正常工作,我将接受此回答。虽然这并不能解释为什么 VB.NET 不喜欢我的正则表达式。感谢你的帮助!我学到了很多!
我认为您可以使用通用的 C++ 注释剥离器。
基本上是
Glbolly 在下面找到,替换为 </code> </p>
<p>演示 PCRE:<a href="https://regex101.com/r/UldYK5/1" rel="nofollow noreferrer">https://regex101.com/r/UldYK5/1</a><br>
演示 Python:<a href="https://regex101.com/r/avfSfB/1" rel="nofollow noreferrer">https://regex101.com/r/avfSfB/1</a></p>
<pre><code> # raw: (?m)((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\]|\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\[\S\s]|[^"\])*"|'(?:\[\S\s]|[^'\])*'|(?:\r?\n|[\S\s])[^/"'\\s]*)
# delimited: /(?m)((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\]|\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|((?:"[^"\]*(?:\[\S\s][^"\]*)*"|'[^'\]*(?:\[\S\s][^'\]*)*'|(?:\r?\n(?:(?=(?:^[ \t]*)?(?:\/\*|\/\/))|[^\/"'\\r\n]*))+|[^\/"'\\r\n]+)+|[\S\s][^\/"'\\r\n]*)/
(?m) # Multi-line modifier
( # (1 start), Comments
(?:
(?: ^ [ \t]* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
[ \t]* \r? \n
(?=
[ \t]*
(?: \r? \n | /\* | // )
)
)?
|
// # Start // comment
(?: # Possible line-continuation
[^\]
| \
(?: \r? \n )?
)*?
(?: # End // comment
\r? \n
(?= # <- To preserve formatting
[ \t]*
(?: \r? \n | /\* | // )
)
| (?= \r? \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
# Quotes
# ======================
(?: # Quote and Non-Comment blocks
"
[^"\]* # Double quoted text
(?: \ [\S\s] [^"\]* )*
"
| # --------------
'
[^'\]* # Single quoted text
(?: \ [\S\s] [^'\]* )*
'
| # --------------
(?: # Qualified Linebreak's
\r? \n
(?:
(?= # If comment ahead just stop
(?: ^ [ \t]* )?
(?: /\* | // )
)
| # or,
[^/"'\\r\n]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)
)+
| # --------------
[^/"'\\r\n]+ # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)+ # Grab multiple instances
| # or,
# ======================
# Pass through
[\S\s] # Any other char
[^/"'\\r\n]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end), Non - comments
如果您使用不支持断言的特定引擎,
那你就得用这个了。
但这不会保留格式。
用法同上。
# (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\]|\\n?)*?\n)|("(?:\[\S\s]|[^"\])*"|'(?:\[\S\s]|[^'\])*'|[\S\s][^/"'\]*)
( # (1 start), Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\] | \ \n? )*? # Possible line-continuation
\n # End // comment
) # (1 end)
|
( # (2 start), Non - comments
"
(?: \ [\S\s] | [^"\] )* # Double quoted text
"
| '
(?: \ [\S\s] | [^'\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)