未能将 Netnut.io 代理与 Apify Cheerio 抓取工具一起使用

Failure to use Netnut.io proxy with Apify Cheerio scraper

我开发网络爬虫,我想将 Netnut 的 Proxy 集成到其中。

给出的 Netnut 集成:

Proxy URL: gw.ntnt.io Proxy Port: 5959 Proxy User: igorsavinkin-cc-any Proxy Password: xxxxx

Example Rotating IP format (IP:PORT:USERNAME-CC-COUNTRY:PASSWORD): gw.ntnt.io:5959:igorsavinkin-cc-any:xxxxx

In order to change the country, please change 'any' to your desired country. (US, UK, IT, DE etc.) Available countries: https://l.netnut.io/countries

Our IPs are automatically rotated, if you wish to make them Static Residential, please add a session ID in the username parameter like the example below:

Username-cc-any-sid-any_number

代码:

    Apify.main(async () => { 
    const proxyConfiguration = await Apify.createProxyConfiguration({
    proxyUrls: [ 
            'gw.ntnt.io:5959:igorsavinkin-DE:xxxxx'
        ]
    });
    // Add URLs to a RequestList
    const requestQueue = await Apify.openRequestQueue(queue_name);
    await requestQueue.addRequest({ url: 'https://ip.nf/me.txt' });
    
    // Create an instance of the CheerioCrawler class - a crawler
    // that automatically loads the URLs and parses their HTML using the cheerio library.
    const crawler = new Apify.CheerioCrawler({ 
        // Let the crawler fetch URLs from our list.
        requestQueue,
        // To use the proxy IP session rotation logic, you must turn the proxy usage on.
        proxyConfiguration,
        // Activates the Session pool.         
        minConcurrency: 10,
        maxConcurrency: 50,
        // On error, retry each page at most once.
        maxRequestRetries: 2,

        // Increase the timeout for processing of each page.
        handlePageTimeoutSecs: 50,

        // Limit to 10 requests per one crawl
        maxRequestsPerCrawl: 1000,

        handlePageFunction: async ({ request, $/*, session*/ }) => {
            const text = $('body').text();
            log.info(text);
            ...
       });
       await crawler.run();
    });

错误:RequestError: getaddrinfo ENOTFOUND 5959 5959:80

爬虫似乎混合了 url 端口 5959 和 80...

ERROR CheerioCrawler: handleRequestFunction failed, reclaiming failed request
 back to the list or queue {"url":"https://ip.nf/me.txt","retryCount":3,"id":
"F32s4Txz0fBUmwd"}
  RequestError: getaddrinfo ENOTFOUND 5959 5959:80
      at ClientRequest.request.once (C:\Users\User\Documents\RnD\Node.js\merc
ateo-scraper\node_modules\got\dist\source\core\index.js:953:111)
      at Object.onceWrapper (events.js:285:13)
      at ClientRequest.emit (events.js:202:15)
      at ClientRequest.origin.emit.args (C:\Users\User\Documents\RnD\Node.js\
mercateo-scraper\node_modules\@szmarczak\http-timer\dist\source\index.js:39:2
0)
      at onerror (C:\Users\User\Documents\RnD\Node.js\mercateo-scraper\node_m
odules\agent-base\dist\src\index.js:115:21)
      at callbackError (C:\Users\User\Documents\RnD\Node.js\mercateo-scraper\
node_modules\agent-base\dist\src\index.js:134:17)
      at processTicksAndRejections (internal/process/next_tick.js:81:5)

有什么办法吗?

尝试以这种格式使用它:

http://用户名:密码@主机:端口