如何对此网络抓取代码实施多处理?我应该改用多线程吗?
How can I implement multiprocessing to this web scraping code? Should I use multi threading instead?
我想要实现的是缩短完成抓取过程所需的时间并将所有数据存储在字典中(字典是 Untiters
键是用户名,值是次数用户创建了一个具有特定名称的 post)我使用 this 网站作为教程,但我不知道如何实现我的代码中解释的内容。这是代码,抱歉,如果我提供了不必要的大部分代码。
from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
z = 0
Untitleds = ["Sin título","Untitled","Sans titre","İsimsiz","Ohne Titel","بلا عنوان",
"Без названия","无标题","夕イトルなし"]
Untiters = {}
Untits = []
x = 138
for i in range(1,20):
y = x + 1
x = y
Id = y
link = "https://folioscope.co/blank/" + str(Id)
Url = (link)
R = requests.get(Url)
Soup = BeautifulSoup(R.text,"html5lib")
Pretitle = (Soup.find("div",{"class":"container_padding"}))
Title = Pretitle.div.text
if Title in (Untitleds):
Prename = Soup.find("div",{"class":"padding_bottom_normal"})
Name = Prename.a.text
Untitled = z + 1
z = Untitled
if Name not in Untiters:
Untiters.update({Name : 1})
else:
c0 = Untiters[Name]
c1 = c0 + 1
Untiters[Name] = c1
Untits.append(Title)
print (Title, Name)
要使用multiprocessing.Pool
从站点获取数据,您可以使用以下示例:
from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
def get_data(id_):
url = "https://folioscope.co/blank/" + str(id_)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
title = soup.select_one("#animation_container .title") or ""
if title:
title = title.text
username = soup.select_one(".username") or ""
if username:
username = username.text
return id_, title, username
if __name__ == "__main__":
with Pool() as pool:
for id_, title, username in pool.imap_unordered(
get_data, range(138, 158)
):
if title and username:
print("{:<4} {:<40} {}".format(id_, title, username))
# here you can add the result to list, filter duplicates etc.
打印:
153 First attempt CyberAly
149 Minecraft Loop MisterD
142 An Idea! Pyro
148 Untitled szymun
152 Thunder dpknyk1993
139 Untitled WoopDeDoo
146 Untitled szymun
144 Loop pjrd
138 Blink fairyfina
140 Test sknob
154 Dragon Ball kameha piedicmolkok
157 Boom animation33
156 Tree in wind CyberAly
我想要实现的是缩短完成抓取过程所需的时间并将所有数据存储在字典中(字典是 Untiters
键是用户名,值是次数用户创建了一个具有特定名称的 post)我使用 this 网站作为教程,但我不知道如何实现我的代码中解释的内容。这是代码,抱歉,如果我提供了不必要的大部分代码。
from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
z = 0
Untitleds = ["Sin título","Untitled","Sans titre","İsimsiz","Ohne Titel","بلا عنوان",
"Без названия","无标题","夕イトルなし"]
Untiters = {}
Untits = []
x = 138
for i in range(1,20):
y = x + 1
x = y
Id = y
link = "https://folioscope.co/blank/" + str(Id)
Url = (link)
R = requests.get(Url)
Soup = BeautifulSoup(R.text,"html5lib")
Pretitle = (Soup.find("div",{"class":"container_padding"}))
Title = Pretitle.div.text
if Title in (Untitleds):
Prename = Soup.find("div",{"class":"padding_bottom_normal"})
Name = Prename.a.text
Untitled = z + 1
z = Untitled
if Name not in Untiters:
Untiters.update({Name : 1})
else:
c0 = Untiters[Name]
c1 = c0 + 1
Untiters[Name] = c1
Untits.append(Title)
print (Title, Name)
要使用multiprocessing.Pool
从站点获取数据,您可以使用以下示例:
from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
def get_data(id_):
url = "https://folioscope.co/blank/" + str(id_)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
title = soup.select_one("#animation_container .title") or ""
if title:
title = title.text
username = soup.select_one(".username") or ""
if username:
username = username.text
return id_, title, username
if __name__ == "__main__":
with Pool() as pool:
for id_, title, username in pool.imap_unordered(
get_data, range(138, 158)
):
if title and username:
print("{:<4} {:<40} {}".format(id_, title, username))
# here you can add the result to list, filter duplicates etc.
打印:
153 First attempt CyberAly
149 Minecraft Loop MisterD
142 An Idea! Pyro
148 Untitled szymun
152 Thunder dpknyk1993
139 Untitled WoopDeDoo
146 Untitled szymun
144 Loop pjrd
138 Blink fairyfina
140 Test sknob
154 Dragon Ball kameha piedicmolkok
157 Boom animation33
156 Tree in wind CyberAly