Apache Nutch 的 regex-urlfilter 语法

regex-urlfilter syntax with Apache Nutch

我想以如下格式过滤 url:https://www.abcd.com/def/* 这意味着 def/ 之后的任何内容都可以,只要域是 www.abcd.com 并且 /def/ 是required ,但在花了很多时间之后我仍然无法弄清楚如何编写正确的正则表达式。

这可能有效:

 +^https://www.abcd.com/def/(.*)
#(skip URLs containing certain characters as probable queries, etc.)
 -^https://www.abcd.com/def/[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
 -^https://www.abcd.com/def/.*(/[^/]+)/[^/]+/[^/]+/
#comment "accept everything else"   
#+.