Python 请求-期货速度慢
Python Requests-futures slow
我目前正在使用 requests-futures 来加快网络抓取速度。问题是,它仍然很慢。大约每隔一秒 1 次。 ThreadPoolExecutor 的外观如下:
with FuturesSession(executor=ThreadPoolExecutor(max_workers=8)) as session:
futures = {session.get(url, proxies={
'http': str(random.choice(proxy_list).replace("https:/", "http:/")),
'https': str(random.choice(proxy_list).replace("https:/", "http:/")),
}, headers={
'User-Agent': str(ua.chrome),
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Content-Type': 'text/plain;charset=UTF-8',
}): url for url in url_list}
# ---
for future in as_completed(futures):
del futures[future]
try:
resp = future.result()
except:
print("Error getting result from thread. Ignoring")
try:
multiprocessing.Process(target=main_func, args=(resp,))
del resp
del future
except requests.exceptions.JSONDecodeError:
logging.warning(
"[requests.custom.debug]: requests.exceptions.JSONDecodeError: [Error] print(resp.json())")
我认为它之所以慢是因为 as_completed for 循环,因为它不是并发循环。至于我将响应传递给的 main_func,这是使用 bs4 站点信息的函数。如果 as_completed for 循环是并发的,那么它仍然会比这更快。我真的希望 scraper 更快,我觉得我想继续使用请求未来,但如果有很多更快的东西,我很乐意改变。因此,如果有人知道比 requests-futures 快得多的东西,请随时分享
有人能帮忙吗?谢谢
下面是对代码的重组,应该会有帮助:
import requests
from concurrent.futures import ProcessPoolExecutor
import random
proxy_list = [
'http://107.151.182.247:80',
'http://194.5.193.183:80',
'http://88.198.50.103:8080',
'http://88.198.24.108:8080',
'http://64.44.164.254:80',
'http://47.74.152.29:8888',
'http://176.9.75.42:8080']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36',
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Content-Type': 'text/plain;charset=UTF-8'}
url_list = ['http://www.google.com', 'http://facebook.com', 'http://twitter.com']
def process(url):
proxy = random.choice(proxy_list)
https = proxy.replace('http:', 'https:')
http = proxy.replace('https:', 'http:')
proxies = {'http': http, 'https': https}
try:
(r := requests.get(url, proxies=proxies)).raise_for_status()
# call main_func here
except Exception as e:
return e
return 'OK'
def main():
with ProcessPoolExecutor() as executor:
for result in executor.map(process, url_list):
print(result)
if __name__ == '__main__':
main()
proxy_list 可能不适合您。使用您自己的代理列表。
显然 url_list 与您的不匹配。
重点是每个 URL 都在自己的进程中处理。在这种情况下确实没有必要混合线程和进程,特别是因为它在您等待线程完成之前增加了一定程度的同步性 运行 a sub-process.
我目前正在使用 requests-futures 来加快网络抓取速度。问题是,它仍然很慢。大约每隔一秒 1 次。 ThreadPoolExecutor 的外观如下:
with FuturesSession(executor=ThreadPoolExecutor(max_workers=8)) as session:
futures = {session.get(url, proxies={
'http': str(random.choice(proxy_list).replace("https:/", "http:/")),
'https': str(random.choice(proxy_list).replace("https:/", "http:/")),
}, headers={
'User-Agent': str(ua.chrome),
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Content-Type': 'text/plain;charset=UTF-8',
}): url for url in url_list}
# ---
for future in as_completed(futures):
del futures[future]
try:
resp = future.result()
except:
print("Error getting result from thread. Ignoring")
try:
multiprocessing.Process(target=main_func, args=(resp,))
del resp
del future
except requests.exceptions.JSONDecodeError:
logging.warning(
"[requests.custom.debug]: requests.exceptions.JSONDecodeError: [Error] print(resp.json())")
我认为它之所以慢是因为 as_completed for 循环,因为它不是并发循环。至于我将响应传递给的 main_func,这是使用 bs4 站点信息的函数。如果 as_completed for 循环是并发的,那么它仍然会比这更快。我真的希望 scraper 更快,我觉得我想继续使用请求未来,但如果有很多更快的东西,我很乐意改变。因此,如果有人知道比 requests-futures 快得多的东西,请随时分享
有人能帮忙吗?谢谢
下面是对代码的重组,应该会有帮助:
import requests
from concurrent.futures import ProcessPoolExecutor
import random
proxy_list = [
'http://107.151.182.247:80',
'http://194.5.193.183:80',
'http://88.198.50.103:8080',
'http://88.198.24.108:8080',
'http://64.44.164.254:80',
'http://47.74.152.29:8888',
'http://176.9.75.42:8080']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36',
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Content-Type': 'text/plain;charset=UTF-8'}
url_list = ['http://www.google.com', 'http://facebook.com', 'http://twitter.com']
def process(url):
proxy = random.choice(proxy_list)
https = proxy.replace('http:', 'https:')
http = proxy.replace('https:', 'http:')
proxies = {'http': http, 'https': https}
try:
(r := requests.get(url, proxies=proxies)).raise_for_status()
# call main_func here
except Exception as e:
return e
return 'OK'
def main():
with ProcessPoolExecutor() as executor:
for result in executor.map(process, url_list):
print(result)
if __name__ == '__main__':
main()
proxy_list 可能不适合您。使用您自己的代理列表。
显然 url_list 与您的不匹配。
重点是每个 URL 都在自己的进程中处理。在这种情况下确实没有必要混合线程和进程,特别是因为它在您等待线程完成之前增加了一定程度的同步性 运行 a sub-process.