如何更快地解析 Grequests 的响应?
How to parse the response from Grequests faster?
我想抓取多个 url 并尽可能快地解析,但 for 循环对我来说并不太快,有办法用异步或多处理或多线程来做到这一点吗?
import grequests
from bs4 import BeautifulSoup
links1 = [] #multiple links
while True:
try:
reqs = (grequests.get(link) for link in links1)
resp = grequests.imap(reqs, size=25, stream=False)
for r in resp: # I WANT TO RUN THIS FOR LOOP QUICK AS POSSIBLE ITS POSSIBLE?
soup = BeautifulSoup(r.text, 'lxml')
parse = soup.find('div', class_='txt')
示例如何将 multiprocessing
与 requests
/BeautifulSoup
一起使用:
import requests
from tqdm import tqdm # for pretty progress bar
from bs4 import BeautifulSoup
from multiprocessing import Pool
# some 1000 links to analyze
links1 = [
"https://en.wikipedia.org/wiki/2021_Moroccan_general_election",
"https://en.wikipedia.org/wiki/Tangerang_prison_fire",
"https://en.wikipedia.org/wiki/COVID-19_pandemic",
"https://en.wikipedia.org/wiki/Yolanda_Fern%C3%A1ndez_de_Cofi%C3%B1o",
] * 250
def parse(url):
soup = BeautifulSoup(requests.get(url).content, "html.parser")
return soup.select_one("h1").get_text(strip=True)
if __name__ == "__main__":
with Pool() as p:
out = []
for r in tqdm(p.imap(parse, links1), total=len(links1)):
out.append(r)
print(len(out))
通过我的互联网 connection/CPU (Ryzen 3700x) 我能够在 30 秒内从所有 1000 个链接中获得结果:
100%|██████████| 1000/1000 [00:30<00:00, 33.12it/s]
1000
我所有的 CPU 都被利用了(来自 htop
的屏幕截图):
我想抓取多个 url 并尽可能快地解析,但 for 循环对我来说并不太快,有办法用异步或多处理或多线程来做到这一点吗?
import grequests
from bs4 import BeautifulSoup
links1 = [] #multiple links
while True:
try:
reqs = (grequests.get(link) for link in links1)
resp = grequests.imap(reqs, size=25, stream=False)
for r in resp: # I WANT TO RUN THIS FOR LOOP QUICK AS POSSIBLE ITS POSSIBLE?
soup = BeautifulSoup(r.text, 'lxml')
parse = soup.find('div', class_='txt')
示例如何将 multiprocessing
与 requests
/BeautifulSoup
一起使用:
import requests
from tqdm import tqdm # for pretty progress bar
from bs4 import BeautifulSoup
from multiprocessing import Pool
# some 1000 links to analyze
links1 = [
"https://en.wikipedia.org/wiki/2021_Moroccan_general_election",
"https://en.wikipedia.org/wiki/Tangerang_prison_fire",
"https://en.wikipedia.org/wiki/COVID-19_pandemic",
"https://en.wikipedia.org/wiki/Yolanda_Fern%C3%A1ndez_de_Cofi%C3%B1o",
] * 250
def parse(url):
soup = BeautifulSoup(requests.get(url).content, "html.parser")
return soup.select_one("h1").get_text(strip=True)
if __name__ == "__main__":
with Pool() as p:
out = []
for r in tqdm(p.imap(parse, links1), total=len(links1)):
out.append(r)
print(len(out))
通过我的互联网 connection/CPU (Ryzen 3700x) 我能够在 30 秒内从所有 1000 个链接中获得结果:
100%|██████████| 1000/1000 [00:30<00:00, 33.12it/s]
1000
我所有的 CPU 都被利用了(来自 htop
的屏幕截图):