理解为什么负前瞻不起作用

Question

假设我有这个 url:

https://www.google.com/search?q=test&tbm=isch&randomParameters=123

我想匹配 google 的搜索 url，但它不包含：

tbm=isch

tbm=news

param1=432

我试过这个模式：

^http(s):\/\/www.google.(.*)\/(search|webhp)\?(?![\s]+(tbm=isch|tbm=news|param1=432))

但它不起作用（因为仍在匹配），示例 url

Answer 1

你的正则表达式应该是

^https:\/\/www.google.([^\/]*)\/(search|webhp)\?(?!.*(tbm\=isch|tbm\=news|param1\=432)).*$

example

问题是您尝试使用 \s* 而不是 .* 进行前瞻，这将匹配任意数量的字符。

另外 www.google.(.*) 会导致大量回溯导致性能问题，所以我将其替换为 www.google.([^\/]*)

编辑

我想知道为什么您为此使用正则表达式而不是简单的 indexof 或您使用的语言中的类似方法。这里有什么特殊的用例吗？？

Answer 2

您可以使用：

^                         # anchor it to the beginning
https?://                 # http or https
(?:
    (?!tbm=(?:isch|news)) # first neg. lookahead
    (?!param1=432)        # second
    \S                    # anything but whitespace
)+
$                         # THE END

见a demo on regex101.com。
不过，对于您的特定编程语言，可能会有像 urlparse() 这样的内置方法。

Answer 3

您应该将 [\s]+ 更改为 .*? 或 [\S]*? 并且您的正则表达式将起作用。也匹配整个url，如果符合条件，可以在末尾再添加一个[\S]*：

^http(s):\/\/www.google.([\w\.]*)\/(search|webhp)\?(?![\S]*?(tbm=isch|tbm=news|param1=432))[\S]*

理解为什么负前瞻不起作用

Understanding why negative lookahead is not working

regex

regex-lookarounds