匹配单词边界末尾的星号 * 字符 \b

Question

在构建检测被审查的亵渎用法的轻量级工具时，我注意到检测单词边界末尾的特殊字符非常困难。

使用字符串元组，我构建了一个 OR'd 词边界正则表达式：

import re

PHRASES = (
    'sh\*t',  # easy
    'sh\*\*',  # difficult
    'f\*\*k',  # easy
    'f\*\*\*',  # difficult
)

MATCHER = re.compile(
    r"\b(%s)\b" % "|".join(PHRASES), 
    flags=re.IGNORECASE | re.UNICODE)

问题是 * 不是可以在单词边界 \b 旁边检测到的东西。

print(MATCHER.search('Well f*** you!'))  # Fail - Does not find f***
print(MATCHER.search('Well f***!'))  # Fail - Does not find f***
print(MATCHER.search('f***'))  # Fail - Does not find f***
print(MATCHER.search('f*** this!'))  # Fail - Does not find f***
print(MATCHER.search('secret code is 123f***'))  # Pass - Should not match
print(MATCHER.search('f**k this!'))  # Pass - Should find

有什么想法可以方便地进行设置以支持以特殊字符结尾的短语吗？

Answer 1

可以像

一样在每个字符串中嵌入边界要求

'\bsh\*t\b', 
'\bsh\*\*',  
'\bf\*\*k\b',  
'\bf\*\*\*',

然后 r"(%s)" % "|".join(PHRASES)

或者，如果正则表达式引擎支持条件，它会像这样完成

'sh\*t', 
'sh\*\*',  
'f\*\*k',  
'f\*\*\*',

然后 r"(?(?=\w)\b)(%s)(?(?<=\w)\b)" % "|".join(PHRASES)

Answer 2

利用您对短语开头和结尾的了解，并将它们与相应的匹配器一起使用。
这里是一个静态版本，但是很容易根据开始和结束自动对输入的新短语进行排序。

import re

PHRASES1 = (
    'sh\*t',  # easy
    'f\*\*k',  # easy
)
PHRASES2 = (
    'sh\*\*',  # difficult
    'f\*\*\*',  # difficult
)
PHRASES3 = (
    '\*\*\*hole', 
)
PHRASES4 = (
    '\*\*\*sonofa\*\*\*\*\*',  # easy
)
MATCHER1 = re.compile(
    r"\b(%s)\b" % "|".join(PHRASES1), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER2 = re.compile(
    r"\b(%s)[$\s]" % "|".join(PHRASES2), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER3 = re.compile(
    r"[\s^](%s)\b" % "|".join(PHRASES3), 
    flags=re.IGNORECASE | re.UNICODE)
MATCHER4 = re.compile(
    r"[\s^](%s)[$\s]" % "|".join(PHRASES4), 
    flags=re.IGNORECASE | re.UNICODE)

Answer 3

* 不是 单词字符 因此没有 mach，如果后跟 \b 和 非单词字符.

假设初始单词边界没问题，但你想匹配 sh*t 但 而不是 sh*t* 或匹配 f***! 但 not f***a 使用负数来模拟你自己的单词边界怎么样 lookahead.

\b(...)(?![\w*])

See this demo at regex101

如果需要，可以将开头词边界 \b 替换为负后视：(?<![\w*])

Answer 4

我不完全理解你所说的 * 不是可以在单词边界旁边找到的东西。但是，如果我从评论中正确理解了您要查找的内容，我认为这会起作用：

\b[\w]\*+[\w]*

单词边界
后跟一些字母，例如 f
后跟一个或多个 *
可选择以某个字母结尾，例如 k

示例：

https://regexr.com/4nqie

匹配单词边界末尾的星号 * 字符 \b

Match star * character at end of word boundary \b

regex

profanity