Apache Nutch 的 regex-urlfilter 语法
regex-urlfilter syntax with Apache Nutch
我想以如下格式过滤 url:https://www.abcd.com/def/* 这意味着 def/ 之后的任何内容都可以,只要域是 www.abcd.com 并且 /def/ 是required ,但在花了很多时间之后我仍然无法弄清楚如何编写正确的正则表达式。
这可能有效:
+^https://www.abcd.com/def/(.*)
#(skip URLs containing certain characters as probable queries, etc.)
-^https://www.abcd.com/def/[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-^https://www.abcd.com/def/.*(/[^/]+)/[^/]+/[^/]+/
#comment "accept everything else"
#+.
我想以如下格式过滤 url:https://www.abcd.com/def/* 这意味着 def/ 之后的任何内容都可以,只要域是 www.abcd.com 并且 /def/ 是required ,但在花了很多时间之后我仍然无法弄清楚如何编写正确的正则表达式。
这可能有效:
+^https://www.abcd.com/def/(.*)
#(skip URLs containing certain characters as probable queries, etc.)
-^https://www.abcd.com/def/[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-^https://www.abcd.com/def/.*(/[^/]+)/[^/]+/[^/]+/
#comment "accept everything else"
#+.