Python 中的多线程 API 查询时出错

Question

我在部署了开源路由机 (OSRM) 的服务器中执行查询。我发送了一组坐标并获得了街道网络上网络距离的 n x n 矩阵。

为了提高计算速度，我想使用“ThreadPoolExecutor”来并行化查询。

到目前为止，我以两种方式设置连接，都给我同样的错误：


def osrm_query(url_input):
    'Send request'
    response = requests.get(url_input)
    r = response.json()

    return r


def osrm_query_2(url_input):
    'Send request'
    s = requests.Session()
    retries = Retry(total=3,
                    backoff_factor=0.1,
                    status_forcelist=[ 500, 502, 503, 504 ])

    s.mount('https://', HTTPAdapter(max_retries=retries))
    response = s.get(url_input)
    r = response.json()

    return r

我生成了一组 URL（在 _urls 列表中），我想将其作为请求发送并以这种方式并行化：

with ThreadPoolExecutor(max_workers=5) as executor:
    for each in executor.map(osrm_query_2, _urls):
        r.append(each)

到目前为止，一切正常，但是，当解析超过 40,000 个 URL 时，我收到此错误：

OSError: [WinError 10048] Only one usage of each socket address (protocol/networ
k address/port) is normally permitted

据我了解，问题是我从我的机器发送了太多请求，耗尽了可用于发送请求的端口数量（看起来这与我发送的机器无关请求）。

我该如何解决这个问题？ 有没有办法告诉 treadPoolExecutor 重新使用连接？

Answer 1

我被 Stack Overflow 以外的人引导到正确的方向。

诀窍是将池中的工作人员指向请求会话。发送查询的函数 re-worked 如下：

def osrm_query(url_input, session):
    'Send request'
    response = session.get(url_input)
    r = response.json()

    return r

并行化为：

with ThreadPoolExecutor(max_workers=50) as executor:
    with requests.Session() as s:
        for each in executor.map(osrm_query, _urls, repeat(s)):
            r.append(each)

通过这种方式，我将执行时间从 100 分钟（未并行化）减少到 7 分钟，其中 50 个工作人员作为 max_workers 参数，用于 200,000 个 url。

Python 中的多线程 API 查询时出错

Error while multi threading API queries in Python

python

parallel-processing

request

osrm