Nutch 1.4 和 Solr 3.4 - 无法抓取 URL、"no URLs to fetch"

Nutch 1.4 with Solr 3.4 - can't crawl URL, "no URLs to fetch"

我遵循了使用 cygwin、tomcat、nutch 1.4 和 solr 3.4 使用 Nutch 进行网络爬虫的教程。我已经可以抓取 URL 一次,但不知何故这不再有效,无论我尝试哪个 URL。 我在 runtime/local/conf 中的正则表达式-urlfilter.txt 如下:

# skip file: ftp: and mailto: urls

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin

# skip URLs containing certain characters as probable queries, etc.

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops

# accept anything else



$ ./nutch crawl urls -dir newCrawl3 -solr http://localhost:8080/solr/ -depth 2 -topN 3


cygpath: can't convert empty path
crawl started in: newCrawl3
rootUrlDir = urls
threads = 10
depth = 2
topN = 3
Injector: starting at 2017-05-18 17:03:25
Injector: crawlDb: newCrawl3/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2017-05-18 17:03:28, elapsed: 00:00:02
Generator: starting at 2017-05-18 17:03:28
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl3



为什么要用一个非常非常老的Nutch版本?但是尽管如此,您面临的问题是此行开头的 space:


(我用下划线突出显示了 space)以 space、\n# 开头的每一行都会被配置解析器忽略,看一眼:

您可以尝试删除目录newCrawl3。 Nutch 不会再次抓取 url,当它最近被抓取时。