如何使用 Apache Nutch 和 Solr 抓取磁力链接，以便它们在 Solr 查询结果中可用？

Question

我熟悉了使用 Apache Nutch 和 Solr 进行爬网，但意识到虽然 Solr 查询结果中提供了 HTTP 和 HTTPS 链接，但 content 磁场链接却不可用。我把conf/regex-urlfilter.txt调整为

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+/[^/]+/

# for linuxtracker.org
+^https?://*linuxtracker.org/(.+)*$
#+^magnet:\?xt=(.+)*$
    # causes magnet links to be ignored/not appear in content field
+^magnet:*$

# reject anything else
-.

并且不明白为什么磁力链接不应包含在 content 中。如您所见，我正在使用 http://linuxtracker.org which e.g. has the magnet link magnet:?xt=urn:btih:ETDW2XT7HJ2Y6B4Y5G2YSXGC5GWJPF6P on http://linuxtracker.org/?page=torrent-details&id=24c76d5e7f3a758f0798e9b5895cc2e9ac9797cf.

对此进行调查

用bin/crawl爬取后在pysolr查询Solr时出现磁力链接如下：

solr = pysolr.Solr(solr_core_url, timeout=10)
results = solr.search('*:*')
for result in results:
    print(result)

我在 Ubuntu 17.04 上使用 Apache Nutch release-1.13-73-g9446b1e1 和 Solr 6.6.1。

Answer 1

简短回答磁力链接不是“正常”链接，Nutch 不支持开箱即用。

长答案：

在提取链接后应用您更改的配置，在这种情况下，如果您使用 parse-html 解析插件尝试评估 possible outlink is a valid link 这基本上只是创建一个 java.net.URL.

java.net.URL 另一方面不支持开箱即用的磁力链接，根据 javadocs：

Protocol handlers for the following protocols are guaranteed to exist on the search path :
 http, https, ftp, file, and jar

如果您使用 parse-tika something similar is happening.

One option could be to have your custom parser that handles this for you, keep in mind that in any case, you wouldn't want to follow (have as outlinks) the magnet links because Nutch would not be able to process those links.

如果您只想在 Solr/ES 中索引链接（用于搜索），那么您可以编写自己的 HtmlParseFilter 并将这些链接添加到一个单独的字段中。

如何使用 Apache Nutch 和 Solr 抓取磁力链接，以便它们在 Solr 查询结果中可用？

How to crawl magnet links with Apache Nutch and Solr so that they're available in Solr query results?

solr

web-crawler

nutch

magnet-uri