Apify cheerio scraper 即使在队列中有 url 也会停止
Apify cheerio scraper stops even with urls in the queue
场景如下,我正在使用 cheerio scraper 抓取包含房地产公告的网站。
每个公告都有 link 到下一个公告,因此在删除当前页面之前,我将下一页添加到请求队列中。
它总是在某个随机点发生,即使队列中有下一页要抓取(我添加图像),抓取器也会无缘无故地停止。
既然队列中还有待处理的请求,为什么会发生这种情况?
非常感谢
这是我收到的消息:
2021-02-28T10:52:35.439Z INFO CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
2021-02-28T10:52:35.672Z INFO CheerioCrawler: Final request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":963,"requestsFinishedPerMinute":50,"requestsFailedPerMinute":0,"requestTotalDurationMillis":22143,"requestsTotal":23,"crawlerRuntimeMillis":27584,"requestsFinished":23,"requestsFailed":0,"retryHistogram":[23]}
2021-02-28T10:52:35.679Z INFO Cheerio Scraper finished.
这里是请求队列:
代码在这里
async function pageFunction(context) {
const { $, request, log } = context;
// The "$" property contains the Cheerio object which is useful
// for querying DOM elements and extracting data from them.
const pageTitle = $('title').first().text();
// The "request" property contains various information about the web page loaded.
const url = request.url;
// Use "log" object to print information to actor log.
log.info('Scraping Page', { url, pageTitle });
// Adding next page to the queue
var baseUrl = '...';
if($('div.d3-detailpager__element--next a').length > 0)
{
var nextPageUrl = $('div.d3-detailpager__element--next a').attr('href');
log.info('Found another page', { nextUrl: baseUrl.concat(nextPageUrl) });
context.enqueueRequest({ url:baseUrl.concat(nextPageUrl) });
}
// My code for scraping follows here
return { /*my scaped object*/}
}
缺少等待
await context.enqueueRequest
场景如下,我正在使用 cheerio scraper 抓取包含房地产公告的网站。
每个公告都有 link 到下一个公告,因此在删除当前页面之前,我将下一页添加到请求队列中。 它总是在某个随机点发生,即使队列中有下一页要抓取(我添加图像),抓取器也会无缘无故地停止。
既然队列中还有待处理的请求,为什么会发生这种情况? 非常感谢
这是我收到的消息:
2021-02-28T10:52:35.439Z INFO CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
2021-02-28T10:52:35.672Z INFO CheerioCrawler: Final request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":963,"requestsFinishedPerMinute":50,"requestsFailedPerMinute":0,"requestTotalDurationMillis":22143,"requestsTotal":23,"crawlerRuntimeMillis":27584,"requestsFinished":23,"requestsFailed":0,"retryHistogram":[23]}
2021-02-28T10:52:35.679Z INFO Cheerio Scraper finished.
这里是请求队列:
代码在这里
async function pageFunction(context) {
const { $, request, log } = context;
// The "$" property contains the Cheerio object which is useful
// for querying DOM elements and extracting data from them.
const pageTitle = $('title').first().text();
// The "request" property contains various information about the web page loaded.
const url = request.url;
// Use "log" object to print information to actor log.
log.info('Scraping Page', { url, pageTitle });
// Adding next page to the queue
var baseUrl = '...';
if($('div.d3-detailpager__element--next a').length > 0)
{
var nextPageUrl = $('div.d3-detailpager__element--next a').attr('href');
log.info('Found another page', { nextUrl: baseUrl.concat(nextPageUrl) });
context.enqueueRequest({ url:baseUrl.concat(nextPageUrl) });
}
// My code for scraping follows here
return { /*my scaped object*/}
}
缺少等待
await context.enqueueRequest