Apify - 如何在数据集中包含失败的结果
Apify - How to Include Failed Results in Dataset
我们正在使用 Apify Web Scraper actor 创建一个 URL 验证任务,该任务 returns 输入 URL、页面标题和 HTTP 响应状态代码。我们正在使用一组 5 个测试 URL:4 个有效,1 个 non-existent。成功的结果总是包含在数据集中,但从不包含失败的 URL.
日志记录表明对于失败的 URL:
甚至没有达到 pageFunction
2021-05-05T14:50:08.489Z ERROR PuppeteerCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"http://www.invalidurl.com","retryCount":1,"id":"XS9JTk8dYRM8bpM"}
2021-05-05T14:50:08.490Z Error: gotoFunction timed out after 30 seconds.
2021-05-05T14:50:08.490Z at PuppeteerCrawler._handleRequestTimeout (/home/myuser/node_modules/apify/build/crawlers/puppeteer_crawler.js:387:15)
2021-05-05T14:50:08.496Z at PuppeteerCrawler._handleRequestFunction (/home/myuser/node_modules/apify/build/crawlers/puppeteer_crawler.js:329:26)
最终超时,根据我们的设置:
2021-05-05T14:50:42.052Z ERROR Request http://www.invalidurl.com failed and will not be retried anymore. Marking as failed.
2021-05-05T14:50:42.052Z Last Error Message: Error: gotoFunction timed out after 30 seconds.
我尝试将 pageFunction 中的代码包装在 try/catch 块中,但同样,由于无效的 URL 未达到 pageFunction,这并没有解决问题。有没有办法仍然在数据集中包含 hard-coded 响应状态代码为“000”的失败结果? (请参阅下面的 pageFunction 代码。)如果我可以提供任何其他信息,请告诉我,在此先感谢!
async function pageFunction(context) {
context.log.info("Starting pageFunction");
// use jQuery as $
const { request, jQuery: $ } = context;
const { url } = request;
context.log.info("Trying " + url);
let title = null;
let responseCode = null;
try {
context.log.info("In try block for " + url);
title = $('title').first().text().trim();
responseCode = context.response.status;
} catch (error) {
context.log.info("EXCEPTION for " + url);
title = "";
responseCode = "000";
}
return {
url,
title,
responseCode
};
}
你可以使用 https://sdk.apify.com/docs/typedefs/puppeteer-crawler-options#handlefailedrequestfunction:
然后您可以将其推送到所有重试失败时:
handleFailedRequestFunction: async ({ request }) => {
// failed all retries
await Apify.pushData({ url: request.url, responseCode: '000' });
}
我们正在使用 Apify Web Scraper actor 创建一个 URL 验证任务,该任务 returns 输入 URL、页面标题和 HTTP 响应状态代码。我们正在使用一组 5 个测试 URL:4 个有效,1 个 non-existent。成功的结果总是包含在数据集中,但从不包含失败的 URL.
日志记录表明对于失败的 URL:
甚至没有达到 pageFunction2021-05-05T14:50:08.489Z ERROR PuppeteerCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"http://www.invalidurl.com","retryCount":1,"id":"XS9JTk8dYRM8bpM"}
2021-05-05T14:50:08.490Z Error: gotoFunction timed out after 30 seconds.
2021-05-05T14:50:08.490Z at PuppeteerCrawler._handleRequestTimeout (/home/myuser/node_modules/apify/build/crawlers/puppeteer_crawler.js:387:15)
2021-05-05T14:50:08.496Z at PuppeteerCrawler._handleRequestFunction (/home/myuser/node_modules/apify/build/crawlers/puppeteer_crawler.js:329:26)
最终超时,根据我们的设置:
2021-05-05T14:50:42.052Z ERROR Request http://www.invalidurl.com failed and will not be retried anymore. Marking as failed.
2021-05-05T14:50:42.052Z Last Error Message: Error: gotoFunction timed out after 30 seconds.
我尝试将 pageFunction 中的代码包装在 try/catch 块中,但同样,由于无效的 URL 未达到 pageFunction,这并没有解决问题。有没有办法仍然在数据集中包含 hard-coded 响应状态代码为“000”的失败结果? (请参阅下面的 pageFunction 代码。)如果我可以提供任何其他信息,请告诉我,在此先感谢!
async function pageFunction(context) {
context.log.info("Starting pageFunction");
// use jQuery as $
const { request, jQuery: $ } = context;
const { url } = request;
context.log.info("Trying " + url);
let title = null;
let responseCode = null;
try {
context.log.info("In try block for " + url);
title = $('title').first().text().trim();
responseCode = context.response.status;
} catch (error) {
context.log.info("EXCEPTION for " + url);
title = "";
responseCode = "000";
}
return {
url,
title,
responseCode
};
}
你可以使用 https://sdk.apify.com/docs/typedefs/puppeteer-crawler-options#handlefailedrequestfunction:
然后您可以将其推送到所有重试失败时:
handleFailedRequestFunction: async ({ request }) => {
// failed all retries
await Apify.pushData({ url: request.url, responseCode: '000' });
}