在进行多处理时使用 yield 获取结果时遇到问题
Trouble fetching results using yield while going for multiprocessing
我正在尝试使用 python 创建一个脚本,在其中应用多处理以从网页中获取不同用户的 link。虽然 link 的用户在它的登录页面中可用,但我正试图从他们的内页中挖掘出来。但是,当我在 get_links()
函数中使用 yield
并在 get_target_link()
中使用 print()
时,我可以获得预期的结果。
我的问题是:如何在两个函数中使用 yield
实现相同的效果?
我试过:
import requests
import concurrent.futures
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_links(url):
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".summary .question-hyperlink"):
yield urljoin(base,item.get("href"))
def get_target_link(targeturl):
res = requests.get(targeturl)
soup = BeautifulSoup(res.text,"lxml")
name_link = urljoin(base,soup.select_one(".user-details > a").get("href"))
yield name_link
if __name__ == '__main__':
base = 'https://whosebug.com'
mlink = "https://whosebug.com/questions/tagged/web-scraping"
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
future_to_url = {executor.submit(get_target_link, url): url for url in get_links(mlink)}
concurrent.futures.as_completed(future_to_url)
上面的脚本根本没有产生任何结果。
您的初始方法存在一些问题,导致 "no result at all":
BeautifulSoup(res.text,"lxml")
- 将解析器更改为 html.parser
(您正在解析 html web-pages)
- 将函数
get_target_link
作为生成器没有任何好处,因为它不应该成为迭代器,而且它已经一次性产生了一个结果。
concurrent.futures.as_completed
returns 对 Future 实例的迭代器,而不是最终结果
更正后的方法如下所示:
import requests
import concurrent.futures as futures
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_links(url):
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
for link in soup.select(".summary .question-hyperlink"):
yield urljoin(base, link.get("href"))
def get_target_link(target_url):
res = requests.get(target_url)
soup = BeautifulSoup(res.text, "html.parser")
name_link = urljoin(base, soup.select_one(".user-details a").get("href"))
return name_link
if __name__ == '__main__':
base = 'https://whosebug.com'
mlink = "https://whosebug.com/questions/tagged/web-scraping"
with futures.ThreadPoolExecutor(max_workers=10) as executor:
future_to_url = {executor.submit(get_target_link, url): url for url in get_links(mlink)}
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as ex:
print(f'Failed to extract user details from url: {url}')
else:
print(data)
输出:
https://whosebug.com/users/10035985/andrej-kesely
https://whosebug.com/users/11520568/rachit-gupta
https://whosebug.com/users/10568531/robots-txt
https://whosebug.com/users/10664939/logan-anderson
https://whosebug.com/users/688393/c%c3%a9sar
https://whosebug.com/users/903061/gregor
https://whosebug.com/users/9950503/saraherceg
https://whosebug.com/users/80851/gmile
https://whosebug.com/users/11793150/saurabh-rawat
https://whosebug.com/users/11793061/xzatar
https://whosebug.com/users/11759292/rachel9866
https://whosebug.com/users/2628114/user2628114
https://whosebug.com/users/9810397/bart
https://whosebug.com/users/838355/ir2pid
https://whosebug.com/users/10629482/shreya
https://whosebug.com/users/11669928/thor-is
https://whosebug.com/users/7660288/acro2142
https://whosebug.com/users/3342430/freddiev4
https://whosebug.com/users/11767045/k-%c3%96sterlund
https://whosebug.com/users/11781213/mohamed-shire
https://whosebug.com/users/5412619/a-nonymous
https://whosebug.com/users/4354477/forcebru
https://whosebug.com/users/10568531/robots-txt
https://whosebug.com/users/6622587/eyllanesc
https://whosebug.com/users/10568531/robots-txt
https://whosebug.com/users/3273177/casabonita
https://whosebug.com/users/1540328/dipesh-parmar
https://whosebug.com/users/6231957/perth
https://whosebug.com/users/11400264/workin-4weekend
https://whosebug.com/users/1000551/vadim-kotov
https://whosebug.com/users/331508/brock-adams
https://whosebug.com/users/11300154/helloworld1990
https://whosebug.com/users/11786268/mohsine-jirou
https://whosebug.com/users/9707561/fatima-tt
https://whosebug.com/users/11759292/rachel9866
https://whosebug.com/users/6622587/eyllanesc
https://whosebug.com/users/11485683/titan
https://whosebug.com/users/11593630/supek
https://whosebug.com/users/11717116/raja-kishore-patnayakuni
https://whosebug.com/users/975887/madushan
https://whosebug.com/users/10568531/robots-txt
https://whosebug.com/users/283366/phil
https://whosebug.com/users/8677101/bpdesilva
https://whosebug.com/users/3504096/programmerper
https://whosebug.com/users/6303216/akhlaq-ahmed
https://whosebug.com/users/11457578/sh-student
https://whosebug.com/users/11783947/alexis-cruz-cruz
https://whosebug.com/users/3579212/adnanmuttaleb
https://whosebug.com/users/1060350/anony-mousse
https://whosebug.com/users/8100732/khadija-saeed
我正在尝试使用 python 创建一个脚本,在其中应用多处理以从网页中获取不同用户的 link。虽然 link 的用户在它的登录页面中可用,但我正试图从他们的内页中挖掘出来。但是,当我在 get_links()
函数中使用 yield
并在 get_target_link()
中使用 print()
时,我可以获得预期的结果。
我的问题是:如何在两个函数中使用 yield
实现相同的效果?
我试过:
import requests
import concurrent.futures
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_links(url):
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".summary .question-hyperlink"):
yield urljoin(base,item.get("href"))
def get_target_link(targeturl):
res = requests.get(targeturl)
soup = BeautifulSoup(res.text,"lxml")
name_link = urljoin(base,soup.select_one(".user-details > a").get("href"))
yield name_link
if __name__ == '__main__':
base = 'https://whosebug.com'
mlink = "https://whosebug.com/questions/tagged/web-scraping"
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
future_to_url = {executor.submit(get_target_link, url): url for url in get_links(mlink)}
concurrent.futures.as_completed(future_to_url)
上面的脚本根本没有产生任何结果。
您的初始方法存在一些问题,导致 "no result at all":
BeautifulSoup(res.text,"lxml")
- 将解析器更改为html.parser
(您正在解析 html web-pages)- 将函数
get_target_link
作为生成器没有任何好处,因为它不应该成为迭代器,而且它已经一次性产生了一个结果。 concurrent.futures.as_completed
returns 对 Future 实例的迭代器,而不是最终结果
更正后的方法如下所示:
import requests
import concurrent.futures as futures
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def get_links(url):
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
for link in soup.select(".summary .question-hyperlink"):
yield urljoin(base, link.get("href"))
def get_target_link(target_url):
res = requests.get(target_url)
soup = BeautifulSoup(res.text, "html.parser")
name_link = urljoin(base, soup.select_one(".user-details a").get("href"))
return name_link
if __name__ == '__main__':
base = 'https://whosebug.com'
mlink = "https://whosebug.com/questions/tagged/web-scraping"
with futures.ThreadPoolExecutor(max_workers=10) as executor:
future_to_url = {executor.submit(get_target_link, url): url for url in get_links(mlink)}
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as ex:
print(f'Failed to extract user details from url: {url}')
else:
print(data)
输出:
https://whosebug.com/users/10035985/andrej-kesely
https://whosebug.com/users/11520568/rachit-gupta
https://whosebug.com/users/10568531/robots-txt
https://whosebug.com/users/10664939/logan-anderson
https://whosebug.com/users/688393/c%c3%a9sar
https://whosebug.com/users/903061/gregor
https://whosebug.com/users/9950503/saraherceg
https://whosebug.com/users/80851/gmile
https://whosebug.com/users/11793150/saurabh-rawat
https://whosebug.com/users/11793061/xzatar
https://whosebug.com/users/11759292/rachel9866
https://whosebug.com/users/2628114/user2628114
https://whosebug.com/users/9810397/bart
https://whosebug.com/users/838355/ir2pid
https://whosebug.com/users/10629482/shreya
https://whosebug.com/users/11669928/thor-is
https://whosebug.com/users/7660288/acro2142
https://whosebug.com/users/3342430/freddiev4
https://whosebug.com/users/11767045/k-%c3%96sterlund
https://whosebug.com/users/11781213/mohamed-shire
https://whosebug.com/users/5412619/a-nonymous
https://whosebug.com/users/4354477/forcebru
https://whosebug.com/users/10568531/robots-txt
https://whosebug.com/users/6622587/eyllanesc
https://whosebug.com/users/10568531/robots-txt
https://whosebug.com/users/3273177/casabonita
https://whosebug.com/users/1540328/dipesh-parmar
https://whosebug.com/users/6231957/perth
https://whosebug.com/users/11400264/workin-4weekend
https://whosebug.com/users/1000551/vadim-kotov
https://whosebug.com/users/331508/brock-adams
https://whosebug.com/users/11300154/helloworld1990
https://whosebug.com/users/11786268/mohsine-jirou
https://whosebug.com/users/9707561/fatima-tt
https://whosebug.com/users/11759292/rachel9866
https://whosebug.com/users/6622587/eyllanesc
https://whosebug.com/users/11485683/titan
https://whosebug.com/users/11593630/supek
https://whosebug.com/users/11717116/raja-kishore-patnayakuni
https://whosebug.com/users/975887/madushan
https://whosebug.com/users/10568531/robots-txt
https://whosebug.com/users/283366/phil
https://whosebug.com/users/8677101/bpdesilva
https://whosebug.com/users/3504096/programmerper
https://whosebug.com/users/6303216/akhlaq-ahmed
https://whosebug.com/users/11457578/sh-student
https://whosebug.com/users/11783947/alexis-cruz-cruz
https://whosebug.com/users/3579212/adnanmuttaleb
https://whosebug.com/users/1060350/anony-mousse
https://whosebug.com/users/8100732/khadija-saeed