如何为提及和主题标签修复此正则表达式?
How to fix this regex for mentions and hashtags?
我使用以下 tool to build a valid regex 来提及和标签。我已经设法在插入的文本中匹配到我想要的内容,但我需要解决以下匹配问题。
Only match those substrings which start and end with spaces. And in the case of a substring at the beginning or at the end of the string
that is valid (be it a hashtag or a mention), also take it.
The matches found by the regex only take the part that does not contain spaces, (that the spaces are only part of the rule, but not
part of the substring).
我使用的正则表达式如下:(([@]{1}|[#]{1})[A-Za-z0-9]+)
字符串匹配的有效性和无效性的一些示例:
"@hello friend" - @hello must be matched as a mention.
"@ hello friend" - here there should be no matches.
"hey@hello @hello" - here only the last @hello must be matched as a mention.
"@hello! hi @hello #hi ##hello" - here only the second @hello and #hi must be matched as a mention and hashtag respectively.
图像中的另一个示例,其中只有 "@word"
应该是有效的提及:
更新 16:35 (GMT-4) 3/15/18
我找到了解决问题的方法,在 PCRE 模式(服务器)中使用 tool 并使用 negative lookbehind
和 negative lookahead
:
(?<![^\s])(([@]{1}|[#]{1})[A-Za-z0-9]+)(?![^\s])
这是比赛:
但现在疑问出现了,它与 C#
? 中的正则表达式一起工作,negative lookahead
和 negative lookbehind
,因为例如在 Javascript 中它不起作用,正如在工具中看到的那样,它用红线标记了我。
您可以将 start/end of 行与 or for space 放在现有的正则表达式周围。
^ - 开始
$ - 结束
\s - space
(^|\s+)(([@]{1}|[#]{1})[A-Za-z0-9]+)(\s+|$)
试试这个模式:
(?:^|\s+)(?:(?<mention>@)|(?<hash>#))(?<item>\w+)(?=\s+)
分解如下:
(?:
创建一个 non-capturing 组
^|\s+
匹配字符串或空格的开头
(?:
创建一个 non-capturing 组
(?<mention>@|(?<hash>#)
创建一个组来匹配 @
或 #
并分别将组命名为 mention 和 hash
(?<item>\w+)
与任何字母数字字符匹配一次或多次,并帮助从组中提取项目以便于使用。
(?=\s+)
创造一个积极的展望来匹配任何 white-space
Fiddle: Live Demo
然后您需要使用基础语言来 trim 返回匹配以删除任何 leading/trailing 空格。
更新
既然你提到你在使用 C#,我想我会为你提供一个 .NET 解决方案来解决你的问题,而不需要 RegEx;虽然我没有测试结果,但我猜这也比使用 RegEx 更快。
就我个人而言,我的 .NET 风格是 Visual Basic,因此我为您提供了一个 VB.NET 解决方案,但您可以通过转换器轻松地 运行 它,因为我从未使用过任何不能在 C# 中使用的东西:
Private Function FindTags(ByVal lead As Char, ByVal source As String) As String()
Dim matches As List(Of String) = New List(Of String)
Dim current_index As Integer = 0
'Loop through all but the last character in the source
For index As Integer = 0 To source.Length - 2
'Reset the current index
current_index = index
'Check if the current character is a "@" or "#" and either we're starting at the beginning of the String or the last character was whitespace and then if the next character is a letter, digit, or end of the String
If source(index) = lead AndAlso (index = 0 OrElse Char.IsWhiteSpace(source, index - 1)) AndAlso (Char.IsLetterOrDigit(source, index + 1) OrElse index + 1 = source.Length - 1) Then
'Loop until the next character is no longer a letter or digit
Do
current_index += 1
Loop While current_index + 1 < source.Length AndAlso Char.IsLetterOrDigit(source, current_index + 1)
'Check if we're at the end of the line or the next character is whitespace
If current_index = source.Length - 1 OrElse Char.IsWhiteSpace(source, current_index + 1) Then
'Add the match to the collection
matches.Add(source.Substring(index, current_index + 1 - index))
End If
End If
Next
Return matches.ToArray()
End Function
Fiddle: Live Demo
这个正则表达式可以为您完成这项工作。
[@#][A-Za-z0-9]+\s|\s[@#][A-Za-z0-9]+
运算符 | 负责生成逻辑 "or",因此您有 2 个不同的表达式需要匹配。
[@#][A-Za-z0-9]+\s
和
\s[@#][A-Za-z0-9]+
哪里
\s - space
我使用以下 tool to build a valid regex 来提及和标签。我已经设法在插入的文本中匹配到我想要的内容,但我需要解决以下匹配问题。
Only match those substrings which start and end with spaces. And in the case of a substring at the beginning or at the end of the string that is valid (be it a hashtag or a mention), also take it.
The matches found by the regex only take the part that does not contain spaces, (that the spaces are only part of the rule, but not part of the substring).
我使用的正则表达式如下:(([@]{1}|[#]{1})[A-Za-z0-9]+)
字符串匹配的有效性和无效性的一些示例:
"@hello friend" - @hello must be matched as a mention.
"@ hello friend" - here there should be no matches.
"hey@hello @hello" - here only the last @hello must be matched as a mention.
"@hello! hi @hello #hi ##hello" - here only the second @hello and #hi must be matched as a mention and hashtag respectively.
图像中的另一个示例,其中只有 "@word"
应该是有效的提及:
更新 16:35 (GMT-4) 3/15/18
我找到了解决问题的方法,在 PCRE 模式(服务器)中使用 tool 并使用 negative lookbehind
和 negative lookahead
:
(?<![^\s])(([@]{1}|[#]{1})[A-Za-z0-9]+)(?![^\s])
这是比赛:
但现在疑问出现了,它与 C#
? 中的正则表达式一起工作,negative lookahead
和 negative lookbehind
,因为例如在 Javascript 中它不起作用,正如在工具中看到的那样,它用红线标记了我。
您可以将 start/end of 行与 or for space 放在现有的正则表达式周围。
^ - 开始
$ - 结束
\s - space
(^|\s+)(([@]{1}|[#]{1})[A-Za-z0-9]+)(\s+|$)
试试这个模式:
(?:^|\s+)(?:(?<mention>@)|(?<hash>#))(?<item>\w+)(?=\s+)
分解如下:
(?:
创建一个 non-capturing 组^|\s+
匹配字符串或空格的开头(?:
创建一个 non-capturing 组(?<mention>@|(?<hash>#)
创建一个组来匹配@
或#
并分别将组命名为 mention 和 hash(?<item>\w+)
与任何字母数字字符匹配一次或多次,并帮助从组中提取项目以便于使用。(?=\s+)
创造一个积极的展望来匹配任何 white-space
Fiddle: Live Demo
然后您需要使用基础语言来 trim 返回匹配以删除任何 leading/trailing 空格。
更新 既然你提到你在使用 C#,我想我会为你提供一个 .NET 解决方案来解决你的问题,而不需要 RegEx;虽然我没有测试结果,但我猜这也比使用 RegEx 更快。
就我个人而言,我的 .NET 风格是 Visual Basic,因此我为您提供了一个 VB.NET 解决方案,但您可以通过转换器轻松地 运行 它,因为我从未使用过任何不能在 C# 中使用的东西:
Private Function FindTags(ByVal lead As Char, ByVal source As String) As String()
Dim matches As List(Of String) = New List(Of String)
Dim current_index As Integer = 0
'Loop through all but the last character in the source
For index As Integer = 0 To source.Length - 2
'Reset the current index
current_index = index
'Check if the current character is a "@" or "#" and either we're starting at the beginning of the String or the last character was whitespace and then if the next character is a letter, digit, or end of the String
If source(index) = lead AndAlso (index = 0 OrElse Char.IsWhiteSpace(source, index - 1)) AndAlso (Char.IsLetterOrDigit(source, index + 1) OrElse index + 1 = source.Length - 1) Then
'Loop until the next character is no longer a letter or digit
Do
current_index += 1
Loop While current_index + 1 < source.Length AndAlso Char.IsLetterOrDigit(source, current_index + 1)
'Check if we're at the end of the line or the next character is whitespace
If current_index = source.Length - 1 OrElse Char.IsWhiteSpace(source, current_index + 1) Then
'Add the match to the collection
matches.Add(source.Substring(index, current_index + 1 - index))
End If
End If
Next
Return matches.ToArray()
End Function
Fiddle: Live Demo
这个正则表达式可以为您完成这项工作。
[@#][A-Za-z0-9]+\s|\s[@#][A-Za-z0-9]+
运算符 | 负责生成逻辑 "or",因此您有 2 个不同的表达式需要匹配。
[@#][A-Za-z0-9]+\s
和
\s[@#][A-Za-z0-9]+
哪里
\s - space