未能将 Netnut.io 代理与 Apify Cheerio 抓取工具一起使用
Failure to use Netnut.io proxy with Apify Cheerio scraper
我开发网络爬虫,我想将 Netnut 的 Proxy 集成到其中。
给出的 Netnut 集成:
Proxy URL: gw.ntnt.io
Proxy Port: 5959
Proxy User: igorsavinkin-cc-any
Proxy Password: xxxxx
Example Rotating IP format (IP:PORT:USERNAME-CC-COUNTRY:PASSWORD):
gw.ntnt.io:5959:igorsavinkin-cc-any:xxxxx
In order to change the country, please change 'any' to your desired
country. (US, UK, IT, DE etc.) Available countries:
https://l.netnut.io/countries
Our IPs are automatically rotated, if you wish to make them Static
Residential, please add a session ID in the username parameter like
the example below:
Username-cc-any-sid-any_number
代码:
Apify.main(async () => {
const proxyConfiguration = await Apify.createProxyConfiguration({
proxyUrls: [
'gw.ntnt.io:5959:igorsavinkin-DE:xxxxx'
]
});
// Add URLs to a RequestList
const requestQueue = await Apify.openRequestQueue(queue_name);
await requestQueue.addRequest({ url: 'https://ip.nf/me.txt' });
// Create an instance of the CheerioCrawler class - a crawler
// that automatically loads the URLs and parses their HTML using the cheerio library.
const crawler = new Apify.CheerioCrawler({
// Let the crawler fetch URLs from our list.
requestQueue,
// To use the proxy IP session rotation logic, you must turn the proxy usage on.
proxyConfiguration,
// Activates the Session pool.
minConcurrency: 10,
maxConcurrency: 50,
// On error, retry each page at most once.
maxRequestRetries: 2,
// Increase the timeout for processing of each page.
handlePageTimeoutSecs: 50,
// Limit to 10 requests per one crawl
maxRequestsPerCrawl: 1000,
handlePageFunction: async ({ request, $/*, session*/ }) => {
const text = $('body').text();
log.info(text);
...
});
await crawler.run();
});
错误:RequestError: getaddrinfo ENOTFOUND 5959 5959:80
爬虫似乎混合了 url 端口 5959 和 80...
ERROR CheerioCrawler: handleRequestFunction failed, reclaiming failed request
back to the list or queue {"url":"https://ip.nf/me.txt","retryCount":3,"id":
"F32s4Txz0fBUmwd"}
RequestError: getaddrinfo ENOTFOUND 5959 5959:80
at ClientRequest.request.once (C:\Users\User\Documents\RnD\Node.js\merc
ateo-scraper\node_modules\got\dist\source\core\index.js:953:111)
at Object.onceWrapper (events.js:285:13)
at ClientRequest.emit (events.js:202:15)
at ClientRequest.origin.emit.args (C:\Users\User\Documents\RnD\Node.js\
mercateo-scraper\node_modules\@szmarczak\http-timer\dist\source\index.js:39:2
0)
at onerror (C:\Users\User\Documents\RnD\Node.js\mercateo-scraper\node_m
odules\agent-base\dist\src\index.js:115:21)
at callbackError (C:\Users\User\Documents\RnD\Node.js\mercateo-scraper\
node_modules\agent-base\dist\src\index.js:134:17)
at processTicksAndRejections (internal/process/next_tick.js:81:5)
有什么办法吗?
尝试以这种格式使用它:
http://用户名:密码@主机:端口
我开发网络爬虫,我想将 Netnut 的 Proxy 集成到其中。
给出的 Netnut 集成:
Proxy URL: gw.ntnt.io Proxy Port: 5959 Proxy User: igorsavinkin-cc-any Proxy Password: xxxxx
Example Rotating IP format (IP:PORT:USERNAME-CC-COUNTRY:PASSWORD): gw.ntnt.io:5959:igorsavinkin-cc-any:xxxxx
In order to change the country, please change 'any' to your desired country. (US, UK, IT, DE etc.) Available countries: https://l.netnut.io/countries
Our IPs are automatically rotated, if you wish to make them Static Residential, please add a session ID in the username parameter like the example below:
Username-cc-any-sid-any_number
代码:
Apify.main(async () => {
const proxyConfiguration = await Apify.createProxyConfiguration({
proxyUrls: [
'gw.ntnt.io:5959:igorsavinkin-DE:xxxxx'
]
});
// Add URLs to a RequestList
const requestQueue = await Apify.openRequestQueue(queue_name);
await requestQueue.addRequest({ url: 'https://ip.nf/me.txt' });
// Create an instance of the CheerioCrawler class - a crawler
// that automatically loads the URLs and parses their HTML using the cheerio library.
const crawler = new Apify.CheerioCrawler({
// Let the crawler fetch URLs from our list.
requestQueue,
// To use the proxy IP session rotation logic, you must turn the proxy usage on.
proxyConfiguration,
// Activates the Session pool.
minConcurrency: 10,
maxConcurrency: 50,
// On error, retry each page at most once.
maxRequestRetries: 2,
// Increase the timeout for processing of each page.
handlePageTimeoutSecs: 50,
// Limit to 10 requests per one crawl
maxRequestsPerCrawl: 1000,
handlePageFunction: async ({ request, $/*, session*/ }) => {
const text = $('body').text();
log.info(text);
...
});
await crawler.run();
});
错误:RequestError: getaddrinfo ENOTFOUND 5959 5959:80
爬虫似乎混合了 url 端口 5959 和 80...
ERROR CheerioCrawler: handleRequestFunction failed, reclaiming failed request
back to the list or queue {"url":"https://ip.nf/me.txt","retryCount":3,"id":
"F32s4Txz0fBUmwd"}
RequestError: getaddrinfo ENOTFOUND 5959 5959:80
at ClientRequest.request.once (C:\Users\User\Documents\RnD\Node.js\merc
ateo-scraper\node_modules\got\dist\source\core\index.js:953:111)
at Object.onceWrapper (events.js:285:13)
at ClientRequest.emit (events.js:202:15)
at ClientRequest.origin.emit.args (C:\Users\User\Documents\RnD\Node.js\
mercateo-scraper\node_modules\@szmarczak\http-timer\dist\source\index.js:39:2
0)
at onerror (C:\Users\User\Documents\RnD\Node.js\mercateo-scraper\node_m
odules\agent-base\dist\src\index.js:115:21)
at callbackError (C:\Users\User\Documents\RnD\Node.js\mercateo-scraper\
node_modules\agent-base\dist\src\index.js:134:17)
at processTicksAndRejections (internal/process/next_tick.js:81:5)
有什么办法吗?
尝试以这种格式使用它:
http://用户名:密码@主机:端口