Crawler4j,一些网址被毫无问题地抓取而另一些则根本没有被抓取
Crawler4j, Some urls are crawled without issue while others are not crawled at all
我一直在研究 Crawler4j,并成功地让它抓取了一些页面,但没有成功抓取其他页面。例如,我已经使用以下代码成功抓取 Reddi:
public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "//home/user/Documents/Misc/Crawler/test";
int numberOfCrawlers = 1;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("https://www.reddit.com/r/movies");
controller.addSeed("https://www.reddit.com/r/politics");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}
}
还有:
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("https://www.reddit.com/");
}
在 MyCrawler.java 中。但是,当我尝试抓取 http://www.ratemyprofessors.com/ 时,程序只是挂起而没有输出,也没有抓取任何内容。我像上面一样使用以下代码,在 myController.java:
controller.addSeed("http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222");
controller.addSeed("http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044");
并且在 MyCrawler.java 中:
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("http://www.ratemyprofessors.com/");
}
所以我想知道:
- 是否有些服务器能够立即识别爬虫并不允许它们收集数据?
- 我注意到 RateMyProfessor 页面是 .jsp 格式;这和它有什么关系吗?
- 有什么方法可以更好地调试它吗?控制台不输出任何东西。
crawler4j
尊重爬虫的礼貌,例如 robots.txt
. In your case this file is the following one。
检查此文件会发现,不允许抓取您给定的种子点:
Disallow: /ShowRatings.jsp
Disallow: /campusRatings.jsp
crawler4j
日志输出支持该理论:
2015-12-15 19:47:18,791 WARN [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222
2015-12-15 19:47:18,793 WARN [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044
我也有类似的问题,我收到的错误消息是:
2017-01-18 14:18:21,136 WARN [Crawler 1] e.u.i.c.c.WebCrawler [:412] 获取 http://people.com/ 时未处理的异常:people.com:80 未能回应
2017-01-18 14:18:21,140 信息 [爬虫 1] e.u.i.c.c.WebCrawler [:357] 堆栈跟踪:
org.apache.http.NoHttpResponseException: people.com:80 未能响应
但我确定 people.com 响应浏览器。
我一直在研究 Crawler4j,并成功地让它抓取了一些页面,但没有成功抓取其他页面。例如,我已经使用以下代码成功抓取 Reddi:
public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "//home/user/Documents/Misc/Crawler/test";
int numberOfCrawlers = 1;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("https://www.reddit.com/r/movies");
controller.addSeed("https://www.reddit.com/r/politics");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}
}
还有:
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("https://www.reddit.com/");
}
在 MyCrawler.java 中。但是,当我尝试抓取 http://www.ratemyprofessors.com/ 时,程序只是挂起而没有输出,也没有抓取任何内容。我像上面一样使用以下代码,在 myController.java:
controller.addSeed("http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222");
controller.addSeed("http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044");
并且在 MyCrawler.java 中:
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("http://www.ratemyprofessors.com/");
}
所以我想知道:
- 是否有些服务器能够立即识别爬虫并不允许它们收集数据?
- 我注意到 RateMyProfessor 页面是 .jsp 格式;这和它有什么关系吗?
- 有什么方法可以更好地调试它吗?控制台不输出任何东西。
crawler4j
尊重爬虫的礼貌,例如 robots.txt
. In your case this file is the following one。
检查此文件会发现,不允许抓取您给定的种子点:
Disallow: /ShowRatings.jsp
Disallow: /campusRatings.jsp
crawler4j
日志输出支持该理论:
2015-12-15 19:47:18,791 WARN [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222
2015-12-15 19:47:18,793 WARN [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044
我也有类似的问题,我收到的错误消息是:
2017-01-18 14:18:21,136 WARN [Crawler 1] e.u.i.c.c.WebCrawler [:412] 获取 http://people.com/ 时未处理的异常:people.com:80 未能回应
2017-01-18 14:18:21,140 信息 [爬虫 1] e.u.i.c.c.WebCrawler [:357] 堆栈跟踪:
org.apache.http.NoHttpResponseException: people.com:80 未能响应
但我确定 people.com 响应浏览器。