如何更快地解析 Grequests 的响应？

Question

我想抓取多个 url 并尽可能快地解析，但 for 循环对我来说并不太快，有办法用异步或多处理或多线程来做到这一点吗？

import grequests
from bs4 import BeautifulSoup


links1 = [] #multiple links


while True:
  try:  
 
   reqs = (grequests.get(link) for link in links1)
   resp = grequests.imap(reqs, size=25, stream=False)
  

   for r in resp:     # I WANT TO RUN THIS FOR LOOP QUICK AS POSSIBLE ITS POSSIBLE? 
    soup = BeautifulSoup(r.text, 'lxml') 
    parse = soup.find('div', class_='txt')

Answer 1

示例如何将 multiprocessing 与 requests/BeautifulSoup 一起使用：

import requests
from tqdm import tqdm  # for pretty progress bar
from bs4 import BeautifulSoup
from multiprocessing import Pool

# some 1000 links to analyze
links1 = [
    "https://en.wikipedia.org/wiki/2021_Moroccan_general_election",
    "https://en.wikipedia.org/wiki/Tangerang_prison_fire",
    "https://en.wikipedia.org/wiki/COVID-19_pandemic",
    "https://en.wikipedia.org/wiki/Yolanda_Fern%C3%A1ndez_de_Cofi%C3%B1o",
] * 250


def parse(url):
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    return soup.select_one("h1").get_text(strip=True)


if __name__ == "__main__":
    with Pool() as p:
        out = []
        for r in tqdm(p.imap(parse, links1), total=len(links1)):
            out.append(r)

    print(len(out))

通过我的互联网 connection/CPU (Ryzen 3700x) 我能够在 30 秒内从所有 1000 个链接中获得结果：

100%|██████████| 1000/1000 [00:30<00:00, 33.12it/s]
1000

我所有的 CPU 都被利用了（来自 htop 的屏幕截图）：

如何更快地解析 Grequests 的响应？

How to parse the response from Grequests faster?

multithreading

asynchronous

beautifulsoup

multiprocessing

grequests