Python HTTPConnectionPool 建立新连接失败：[Errno 11004] getaddrinfo 失败

Question

我想知道我的请求是否被网站阻止了，我需要设置一个 proxy.I 首先尝试关闭 http 的连接，但我 failed.I 也尝试测试我的代码但是现在好像没有outputs.Mybe 我用代理就万事大吉了？这是代码。

import requests
from urllib.parse import urlencode
import json
from bs4 import BeautifulSoup
import re
from html.parser import HTMLParser
from multiprocessing import Pool
from requests.exceptions import RequestException
import time


def get_page_index(offset, keyword):
    #headers = {'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
    data = {
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': 20,
        'cur_tab': 1
    }
    url = 'http://www.toutiao.com/search_content/?' + urlencode(data)
    try:
        response = requests.get(url, headers={'Connection': 'close'})
        response.encoding = 'utf-8'
        if response.status_code == 200:
            return response.text
        return None
    except RequestException as e:
        print(e)

def parse_page_index(html):
    data = json.loads(html)
    if data and 'data' in data.keys():
        for item in data.get('data'):
            url = item.get('article_url')
            if url and len(url) < 100:
                yield url

def get_page_detail(url):
    try:
        response = requests.get(url, headers={'Connection': 'close'})
        response.encoding = 'utf-8'
        if response.status_code == 200:
            return response.text
        return None
    except RequestException as e:
        print(e)

def parse_page_detail(html):
    soup = BeautifulSoup(html, 'lxml')
    title = soup.select('title')[0].get_text()
    pattern = re.compile(r'articleInfo: (.*?)},', re.S)
    pattern_abstract = re.compile(r'abstract: (.*?)\.', re.S)
    res = re.search(pattern, html)
    res_abstract = re.search(pattern_abstract, html)
    if res and res_abstract:
        data = res.group(1).replace(r".replace(/<br \/>|\n|\r/ig, '')", "") + '}'
        abstract = res_abstract.group(1).replace(r"'", "")
        content = re.search(r'content: (.*?),', data).group(1)
        source = re.search(r'source: (.*?),', data).group(1)
        time_pattern = re.compile(r'time: (.*?)}', re.S)
        date = re.search(time_pattern, data).group(1)
        date_today = time.strftime('%Y-%m-%d')
        img = re.findall(r'src=&quot;(.*?)&quot', content)
        if date[1:11] == date_today and len(content) > 50 and img:
            return {
                'title': title,
                'content': content,
                'source': source,
                'date': date,
                'abstract': abstract,
                'img': img[0]
            }

def main(offset):
    flag = 1
    html = get_page_index(offset, '光伏')
    for url in parse_page_index(html):
        html = get_page_detail(url)
        if html:
            data = parse_page_detail(html)
            if data:
                html_parser = HTMLParser()
                cwl = html_parser.unescape(data.get('content'))
                data['content'] = cwl
                print(data)
                print(data.get('img'))
                flag += 1
                if flag == 5:
                    break



if __name__ == '__main__':
    pool = Pool()
    pool.map(main, [i*20 for i in range(10)])

错误就在这里！

HTTPConnectionPool(host='tech.jinghua.cn', port=80): Max retries exceeded with url: /zixun/20160720/f191549.shtml (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x00000000048523C8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))

顺便说一下，当我一开始测试我的代码时，它显示一切正常！提前致谢！

Answer 1

在我看来，您正在达到 HTTPConnectionPool 中的连接限制。因为你同时启动了 10 个线程

尝试以下方法之一：

增加请求超时时间（秒）：requests.get('url', timeout=5)
关闭回复：Response.close()。而不是 returning response.text，将响应分配给变量，关闭响应，然后 return 变量

Answer 2

当我遇到这个问题时，我遇到了以下问题

我无法执行以下操作 - 请求 python 模块无法从任何 url 获取信息。虽然我可以使用浏览器浏览该站点，但也可以使用 wget 或 curl 下载该页面。 - pip install 也没有工作，使用失败并出现以下错误

Failed to establish a new connection: [Errno 11004] getaddrinfo failed

某些站点阻止了我，所以我尝试 forcebindip 为我的 python 模块使用另一个网络接口，然后我删除了它。可能这导致我的网络混乱，我的请求模块甚至直接套接字模块都卡住了，无法获取任何 url.

所以我按照下面的网络配置重置 URL 现在我很好了。

network configuration reset

Answer 3

为了帮助其他人，我遇到了同样的错误消息：

Client-Request-ID=long-string Retry policy did not allow for a retry: , HTTP status code=Unknown, Exception=HTTPSConnectionPool(host='table.table.core.windows.net', port=443): Max retries exceeded with url: /service(PartitionKey='requests',RowKey='9999') (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001D920ADA970>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')).

...尝试使用

从 Azure Table 存储检索记录时

table_service.get_entity(table_name, partition_key, row_key).

我的问题：

我 table_name 定义不正确。

Python HTTPConnectionPool 建立新连接失败：[Errno 11004] getaddrinfo 失败

Python HTTPConnectionPool Failed to establish a new connection: [Errno 11004] getaddrinfo failed

python

multithreading

pool

python-requests