多个请求导致程序崩溃(使用BeautifulSoup)
Multiple requests causing program to crash (using BeautifulSoup)
我正在 python 中编写一个程序,让用户输入多个网站,然后请求并抓取这些网站的标题并将其输出。但是,当程序超过 8 个网站时,程序每次都会崩溃。我不确定这是否是内存问题,但我一直在寻找,找不到任何人有同样的问题。代码如下(我添加了 9 个列表,所以您只需复制并粘贴代码即可查看问题)。
import requests
from bs4 import BeautifulSoup
lst = ['https://covid19tracker.ca/provincevac.html?p=ON', 'https://www.ontario.ca/page/reopening-ontario#foot-1', 'https://blog.twitter.com/en_us/topics/company/2020/keeping-our-employees-and-partners-safe-during-coronavirus.html', 'https://www.aboutamazon.com/news/company-news/amazons-covid-19-blog-updates-on-how-were-responding-to-the-crisis#covid-latest', 'https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility', 'https://news.prudential.com/increasingly-workers-expect-pandemic-workplace-adaptations-to-stick.htm', 'https://www.mckinsey.com/featured-insights/future-of-work/whats-next-for-remote-work-an-analysis-of-2000-tasks-800-jobs-and-nine-countries', 'https://www.gsb.stanford.edu/faculty-research/publications/does-working-home-work-evidence-chinese-experiment', 'https://www.livecareer.com/resources/careers/planning/is-remote-work-here-to-stay']
for websites in range(len(lst)):
url=lst[websites]
cite = requests.get(url,timeout=10).content
soup = BeautifulSoup(cite,'html.parser')
title = soup.find('title').get_text().strip()
print(title)
print("Didn't crash")
第二个网站没有标题但不用担心
为避免页面崩溃,请在requests.get()
中的headers=
参数中添加user-agent
header,否则页面会认为您是机器人并将屏蔽你。
cite = requests.get(url, headers=headers, timeout=10).content
你的情况:
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"
}
lst = [
"https://covid19tracker.ca/provincevac.html?p=ON",
"https://www.ontario.ca/page/reopening-ontario#foot-1",
"https://blog.twitter.com/en_us/topics/company/2020/keeping-our-employees-and-partners-safe-during-coronavirus.html",
"https://www.aboutamazon.com/news/company-news/amazons-covid-19-blog-updates-on-how-were-responding-to-the-crisis#covid-latest",
"https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility",
"https://news.prudential.com/increasingly-workers-expect-pandemic-workplace-adaptations-to-stick.htm",
"https://www.mckinsey.com/featured-insights/future-of-work/whats-next-for-remote-work-an-analysis-of-2000-tasks-800-jobs-and-nine-countries",
"https://www.gsb.stanford.edu/faculty-research/publications/does-working-home-work-evidence-chinese-experiment",
"https://www.livecareer.com/resources/careers/planning/is-remote-work-here-to-stay",
]
for websites in range(len(lst)):
url = lst[websites]
cite = requests.get(url, headers=headers, timeout=10).content
soup = BeautifulSoup(cite, "html.parser")
title = soup.find("title").get_text().strip()
print(title)
print("Didn't crash")
我正在 python 中编写一个程序,让用户输入多个网站,然后请求并抓取这些网站的标题并将其输出。但是,当程序超过 8 个网站时,程序每次都会崩溃。我不确定这是否是内存问题,但我一直在寻找,找不到任何人有同样的问题。代码如下(我添加了 9 个列表,所以您只需复制并粘贴代码即可查看问题)。
import requests
from bs4 import BeautifulSoup
lst = ['https://covid19tracker.ca/provincevac.html?p=ON', 'https://www.ontario.ca/page/reopening-ontario#foot-1', 'https://blog.twitter.com/en_us/topics/company/2020/keeping-our-employees-and-partners-safe-during-coronavirus.html', 'https://www.aboutamazon.com/news/company-news/amazons-covid-19-blog-updates-on-how-were-responding-to-the-crisis#covid-latest', 'https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility', 'https://news.prudential.com/increasingly-workers-expect-pandemic-workplace-adaptations-to-stick.htm', 'https://www.mckinsey.com/featured-insights/future-of-work/whats-next-for-remote-work-an-analysis-of-2000-tasks-800-jobs-and-nine-countries', 'https://www.gsb.stanford.edu/faculty-research/publications/does-working-home-work-evidence-chinese-experiment', 'https://www.livecareer.com/resources/careers/planning/is-remote-work-here-to-stay']
for websites in range(len(lst)):
url=lst[websites]
cite = requests.get(url,timeout=10).content
soup = BeautifulSoup(cite,'html.parser')
title = soup.find('title').get_text().strip()
print(title)
print("Didn't crash")
第二个网站没有标题但不用担心
为避免页面崩溃,请在requests.get()
中的headers=
参数中添加user-agent
header,否则页面会认为您是机器人并将屏蔽你。
cite = requests.get(url, headers=headers, timeout=10).content
你的情况:
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36"
}
lst = [
"https://covid19tracker.ca/provincevac.html?p=ON",
"https://www.ontario.ca/page/reopening-ontario#foot-1",
"https://blog.twitter.com/en_us/topics/company/2020/keeping-our-employees-and-partners-safe-during-coronavirus.html",
"https://www.aboutamazon.com/news/company-news/amazons-covid-19-blog-updates-on-how-were-responding-to-the-crisis#covid-latest",
"https://www.bcg.com/en-us/publications/2021/advantages-of-remote-work-flexibility",
"https://news.prudential.com/increasingly-workers-expect-pandemic-workplace-adaptations-to-stick.htm",
"https://www.mckinsey.com/featured-insights/future-of-work/whats-next-for-remote-work-an-analysis-of-2000-tasks-800-jobs-and-nine-countries",
"https://www.gsb.stanford.edu/faculty-research/publications/does-working-home-work-evidence-chinese-experiment",
"https://www.livecareer.com/resources/careers/planning/is-remote-work-here-to-stay",
]
for websites in range(len(lst)):
url = lst[websites]
cite = requests.get(url, headers=headers, timeout=10).content
soup = BeautifulSoup(cite, "html.parser")
title = soup.find("title").get_text().strip()
print(title)
print("Didn't crash")