Crawler4j，一些网址被毫无问题地抓取而另一些则根本没有被抓取

Question

我一直在研究 Crawler4j，并成功地让它抓取了一些页面，但没有成功抓取其他页面。例如，我已经使用以下代码成功抓取 Reddi：

public class Controller {
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "//home/user/Documents/Misc/Crawler/test";
        int numberOfCrawlers = 1;

        CrawlConfig config = new CrawlConfig();
       config.setCrawlStorageFolder(crawlStorageFolder);

        /*
         * Instantiate the controller for this crawl.
         */
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        /*
         * For each crawl, you need to add some seed urls. These are the first
         * URLs that are fetched and then the crawler starts following links
         * which are found in these pages
         */
        controller.addSeed("https://www.reddit.com/r/movies");
        controller.addSeed("https://www.reddit.com/r/politics");


        /*
         * Start the crawl. This is a blocking operation, meaning that your code
         * will reach the line after this only when crawling is finished.
         */
        controller.start(MyCrawler.class, numberOfCrawlers);
    }


}

还有：

@Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("https://www.reddit.com/");
 }

在 MyCrawler.java 中。但是，当我尝试抓取 http://www.ratemyprofessors.com/ 时，程序只是挂起而没有输出，也没有抓取任何内容。我像上面一样使用以下代码，在 myController.java:

controller.addSeed("http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222");
controller.addSeed("http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044");

并且在 MyCrawler.java 中：

 @Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("http://www.ratemyprofessors.com/");
 }

所以我想知道：

是否有些服务器能够立即识别爬虫并不允许它们收集数据？
我注意到 RateMyProfessor 页面是 .jsp 格式；这和它有什么关系吗？
有什么方法可以更好地调试它吗？控制台不输出任何东西。

Answer 1

crawler4j 尊重爬虫的礼貌，例如 robots.txt. In your case this file is the following one。

检查此文件会发现，不允许抓取您给定的种子点：

 Disallow: /ShowRatings.jsp 
 Disallow: /campusRatings.jsp

crawler4j 日志输出支持该理论：

2015-12-15 19:47:18,791 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222
2015-12-15 19:47:18,793 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044

Answer 2

我也有类似的问题，我收到的错误消息是：

2017-01-18 14:18:21,136 WARN [Crawler 1] e.u.i.c.c.WebCrawler [:412] 获取 http://people.com/ 时未处理的异常：people.com:80 未能回应
2017-01-18 14:18:21,140 信息 [爬虫 1] e.u.i.c.c.WebCrawler [:357] 堆栈跟踪： org.apache.http.NoHttpResponseException: people.com:80 未能响应

但我确定 people.com 响应浏览器。

Crawler4j，一些网址被毫无问题地抓取而另一些则根本没有被抓取

Crawler4j, Some urls are crawled without issue while others are not crawled at all

java

web-crawler

google-crawlers

crawler4j