Apify - 如何有效地排队 URL 变化
Apify - How to Enqueue URL Variations Efficiently
我正在使用 Cheerio 在 Apify 中创建一个新参与者来读取 URLs 和 return 的输入文件,主要有两项:(1) HTTP 状态代码和 (2) HTML标题。作为我们流程的一部分,我希望能够尝试每个输入的最多 4 个变体 URL,例如:
- HTTP://WWW.SOMEURL.COM
- HTTPS://WWW.SOMEURL.COM
- HTTP://SOMEURL.COM
- HTTPS://SOMEURL.COM
如果 4 个变体之一成功,则该过程应忽略其他变体并移至下一个输入 URL。
我将原始输入列表读入 RequestList,然后想在 RequestQueue 中创建变体。这是最有效的方法吗?请看下面的代码,谢谢!
const Apify = require('apify');
const {
utils: { enqueueLinks },
} = Apify;
const urlParse = require('url');
Apify.main(async () => {
const input = await Apify.getInput();
const inputFile = input.inputFile;
console.log('INPUT FILE: ' + inputFile);
const requestList = await Apify.openRequestList('urls', [
{ requestsFromUrl: inputFile, userData: { isFromUrl: true } },
]);
const requestQueue = await Apify.openRequestQueue();
const proxyConfiguration = await Apify.createProxyConfiguration();
const handlePageFunction = async ({ $, request, response }) => {
let parsedHost = urlParse.parse(request.url).host;
let simplifiedHost = parsedHost.replace('www.', '');
const urlPrefixes = ['HTTP://WWW.', 'HTTPS://WWW.', 'HTTP://', 'HTTPS://'];
let i;
for (i = 0; i < urlPrefixes.length; i++) {
let newUrl = urlPrefixes[i] + simplifiedHost;
console.log('NEW URL: ' + newUrl);
await requestQueue.addRequest({ url: newUrl });
}
console.log(`Processing ${request.url}`);
const results = {
inputUrl: request.url,
httpCode: response.statusCode,
title: $('title').first().text().trim(),
responseUrl: response.url
};
await Apify.pushData(results);
};
const crawler = new Apify.CheerioCrawler({
proxyConfiguration,
maxRequestRetries: 0,
handlePageTimeoutSecs: 60,
requestTimeoutSecs: 60,
requestList,
requestQueue,
handlePageFunction,
handleFailedRequestFunction: async ({ request }) => {
await Apify.pushData({ inputUrl: request.url, httpCode: '000', title: '', responseUrl: ''});
}
});
await crawler.run();
});
您应该事先创建 URL 列表。 handlePageFunction
只用于实际的抓取部分,你应该只有 Apify.pushData
:
//...
const initRequestList = await Apify.openRequestList('urls', [
{ requestsFromUrl: inputFile },
]);
const parsedRequests = [];
let req;
while (req = await initRequestList.fetchNextRequest()) {
const parsedHost = urlParse.parse(req .url).host;
const simplifiedHost = parsedHost.replace('www.', '');
const urlPrefixes = ['HTTP://WWW.', 'HTTPS://WWW.', 'HTTP://', 'HTTPS://'];
for (let i = 0; i < urlPrefixes.length; i++) {
let newUrl = urlPrefixes[i] + simplifiedHost;
console.log('NEW URL: ' + newUrl);
parsedRequests.push({
url: newUrl,
userData: { isFromUrl: true }
});
}
}
const requestList = await Apify.openRequestList('starturls', parsedRequests);
//...
const crawler = new Apify.CheerioCrawler({
proxyConfiguration,
maxRequestRetries: 0,
handlePageTimeoutSecs: 60,
requestTimeoutSecs: 60,
handlePageFunction,
requestList,
handleFailedRequestFunction: async ({ request }) => {
await Apify.pushData({ inputUrl: request.url, httpCode: '000', title: '', responseUrl: ''});
}
});
//...
requestsFromUrl
是一个贪心函数,它试图将所有 URL 解析为给定资源。因此您必须将处理作为附加步骤执行。
我正在使用 Cheerio 在 Apify 中创建一个新参与者来读取 URLs 和 return 的输入文件,主要有两项:(1) HTTP 状态代码和 (2) HTML标题。作为我们流程的一部分,我希望能够尝试每个输入的最多 4 个变体 URL,例如:
- HTTP://WWW.SOMEURL.COM
- HTTPS://WWW.SOMEURL.COM
- HTTP://SOMEURL.COM
- HTTPS://SOMEURL.COM
如果 4 个变体之一成功,则该过程应忽略其他变体并移至下一个输入 URL。
我将原始输入列表读入 RequestList,然后想在 RequestQueue 中创建变体。这是最有效的方法吗?请看下面的代码,谢谢!
const Apify = require('apify');
const {
utils: { enqueueLinks },
} = Apify;
const urlParse = require('url');
Apify.main(async () => {
const input = await Apify.getInput();
const inputFile = input.inputFile;
console.log('INPUT FILE: ' + inputFile);
const requestList = await Apify.openRequestList('urls', [
{ requestsFromUrl: inputFile, userData: { isFromUrl: true } },
]);
const requestQueue = await Apify.openRequestQueue();
const proxyConfiguration = await Apify.createProxyConfiguration();
const handlePageFunction = async ({ $, request, response }) => {
let parsedHost = urlParse.parse(request.url).host;
let simplifiedHost = parsedHost.replace('www.', '');
const urlPrefixes = ['HTTP://WWW.', 'HTTPS://WWW.', 'HTTP://', 'HTTPS://'];
let i;
for (i = 0; i < urlPrefixes.length; i++) {
let newUrl = urlPrefixes[i] + simplifiedHost;
console.log('NEW URL: ' + newUrl);
await requestQueue.addRequest({ url: newUrl });
}
console.log(`Processing ${request.url}`);
const results = {
inputUrl: request.url,
httpCode: response.statusCode,
title: $('title').first().text().trim(),
responseUrl: response.url
};
await Apify.pushData(results);
};
const crawler = new Apify.CheerioCrawler({
proxyConfiguration,
maxRequestRetries: 0,
handlePageTimeoutSecs: 60,
requestTimeoutSecs: 60,
requestList,
requestQueue,
handlePageFunction,
handleFailedRequestFunction: async ({ request }) => {
await Apify.pushData({ inputUrl: request.url, httpCode: '000', title: '', responseUrl: ''});
}
});
await crawler.run();
});
您应该事先创建 URL 列表。 handlePageFunction
只用于实际的抓取部分,你应该只有 Apify.pushData
:
//...
const initRequestList = await Apify.openRequestList('urls', [
{ requestsFromUrl: inputFile },
]);
const parsedRequests = [];
let req;
while (req = await initRequestList.fetchNextRequest()) {
const parsedHost = urlParse.parse(req .url).host;
const simplifiedHost = parsedHost.replace('www.', '');
const urlPrefixes = ['HTTP://WWW.', 'HTTPS://WWW.', 'HTTP://', 'HTTPS://'];
for (let i = 0; i < urlPrefixes.length; i++) {
let newUrl = urlPrefixes[i] + simplifiedHost;
console.log('NEW URL: ' + newUrl);
parsedRequests.push({
url: newUrl,
userData: { isFromUrl: true }
});
}
}
const requestList = await Apify.openRequestList('starturls', parsedRequests);
//...
const crawler = new Apify.CheerioCrawler({
proxyConfiguration,
maxRequestRetries: 0,
handlePageTimeoutSecs: 60,
requestTimeoutSecs: 60,
handlePageFunction,
requestList,
handleFailedRequestFunction: async ({ request }) => {
await Apify.pushData({ inputUrl: request.url, httpCode: '000', title: '', responseUrl: ''});
}
});
//...
requestsFromUrl
是一个贪心函数,它试图将所有 URL 解析为给定资源。因此您必须将处理作为附加步骤执行。