如何绕过报纸为某些网页抛出 503 异常
How to get around Newspaper throwing 503 exceptions for certain webpages
我正在尝试使用 newspaper3k
抓取一些网页,但我的程序抛出了 503 异常。任何人都可以帮我找出原因并帮助我解决这个问题吗?确切地说,我不是要捕获这些异常,而是要了解它们发生的原因并尽可能阻止它们。
from newspaper import Article
dates = list()
titles = list()
urls = ['https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-02',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-mps-hearing-may-21',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-05-06',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-fsr-hearing-may-21',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-03-04',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-2019-20-reserve-bank-annual-review',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-12-02',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-28',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-22',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-19',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-09-14']
for url in urls:
speech = Article(url)
speech.download()
speech.parse()
dates.append(speech.publish_date)
titles.append(speech.title)
这是我的回溯:
---------------------------------------------------------------------------
ArticleException Traceback (most recent call last)
<ipython-input-5-217a6cafe26a> in <module>
20 speech = Article(url)
21 speech.download()
---> 22 speech.parse()
23 dates.append(speech.publish_date)
24 titles.append(speech.title)
/opt/anaconda3/lib/python3.8/site-packages/newspaper/article.py in parse(self)
189
190 def parse(self):
--> 191 self.throw_if_not_downloaded_verbose()
192
193 self.doc = self.config.get_parser().fromstring(self.html)
/opt/anaconda3/lib/python3.8/site-packages/newspaper/article.py in throw_if_not_downloaded_verbose(self)
529 raise ArticleException('You must `download()` an article first!')
530 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
--> 531 raise ArticleException('Article `download()` failed with %s on URL %s' %
532 (self.download_exception_msg, self.url))
533
ArticleException: Article `download()` failed with 503 Server Error: Service Temporarily Unavailable
for url: https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29
on URL https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29
以下是解决 Python 包请求的 503 Server Error: Service Temporarily Unavailable
错误的方法。
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
}
base_url = 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29'
req = requests.get(base_url, headers=headers)
print(req.status_code)
# output
503
为什么我们会收到 503 服务器错误?
让我们看看服务器返回的内容。
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
}
base_url = 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29'
req = requests.get(base_url, headers=headers)
print(req.text)
# output
truncated...
<title>Website unavailable - Reserve Bank of New Zealand - Te Pūtea Matua</title>
truncated...
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
truncated...
<form class="challenge-form" id="challenge-form" action="/research-and-publications/speeches/2021/speech2021-06-29?__cf_chl_jschl_tk__=73ad3f68fb15cc9284b25b7802626dd4ebe102cd-1625840173-0-ATQAZ5g7wCwLU2Q7agCqc1p59qs6ghpsYPVhDNwDN5r7vefk0P1UbjR4AJOUl0kUCZmDi-EVWX8XekL6VkqOgKTd1zqd5QWWlT3f2Dp_aUWQgCAH3bnS4x0wyc8-xGOLm-tcMKCXcTXH-OpiGoUX8u__bk1TIZ0gI_TYMB-oy0nJi7dMYLgJnvJhwhTllDoYUbCzmo2h2idIJPqIjNaAwupvbdpvHnrogPDnFhCe8Cco9-eKlq4w0G563f_OJ3M7YQChBjCoHYlT8baMoOLzP-Kb33rNmlG0uXhzoiIBROsPw9pavOrO1vsbqf31ZArDRuy0y7rsfrhAD7iU113zmypN81tgqgL_F8YTzygRvI_z3Cs2YOMxjB53-jq1pWwqsW_ItTaY7I3vh5lg_12EUzEddcwmuIj1wI2NbnA7EU06QNHYYn_Ye4TKM0gu9k4031hGybszE3nRKCdTXgMSgJbYhTJ6bJYPSb_2IHMUHlYyHksxePJ4C_5-5X8qIdJApSTFBfCLLLAZLrkFnBk7ep4" method="POST" enctype="application/x-www-form-urlencoded">
truncated...
var a = document.getElementById('cf-content');
truncated...
<p>Your access to the Reserve Bank website has been restricted. If you think you should be able to access our website please email <a href="mailto:web@rbnz.govt.nz">web@rbnz.govt.nz</a>.
如果我们查看返回的文本,我们可以看到该网站要求您的浏览器完成 challenge-form.
。如果您查看文本中的其他数据点(例如 cf-content
),您可以看到该网站受到 CloudFlare.
的保护
绕过此保护非常困难。这是我最近关于绕过这种保护的复杂性的回答之一。
我正在尝试使用 newspaper3k
抓取一些网页,但我的程序抛出了 503 异常。任何人都可以帮我找出原因并帮助我解决这个问题吗?确切地说,我不是要捕获这些异常,而是要了解它们发生的原因并尽可能阻止它们。
from newspaper import Article
dates = list()
titles = list()
urls = ['https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-02',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-mps-hearing-may-21',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-05-06',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-fsr-hearing-may-21',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-03-04',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/fec-2019-20-reserve-bank-annual-review',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-12-02',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-28',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-22',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-10-19',
'https://www.rbnz.govt.nz/research-and-publications/speeches/2020/speech2020-09-14']
for url in urls:
speech = Article(url)
speech.download()
speech.parse()
dates.append(speech.publish_date)
titles.append(speech.title)
这是我的回溯:
---------------------------------------------------------------------------
ArticleException Traceback (most recent call last)
<ipython-input-5-217a6cafe26a> in <module>
20 speech = Article(url)
21 speech.download()
---> 22 speech.parse()
23 dates.append(speech.publish_date)
24 titles.append(speech.title)
/opt/anaconda3/lib/python3.8/site-packages/newspaper/article.py in parse(self)
189
190 def parse(self):
--> 191 self.throw_if_not_downloaded_verbose()
192
193 self.doc = self.config.get_parser().fromstring(self.html)
/opt/anaconda3/lib/python3.8/site-packages/newspaper/article.py in throw_if_not_downloaded_verbose(self)
529 raise ArticleException('You must `download()` an article first!')
530 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
--> 531 raise ArticleException('Article `download()` failed with %s on URL %s' %
532 (self.download_exception_msg, self.url))
533
ArticleException: Article `download()` failed with 503 Server Error: Service Temporarily Unavailable
for url: https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29
on URL https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29
以下是解决 Python 包请求的 503 Server Error: Service Temporarily Unavailable
错误的方法。
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
}
base_url = 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29'
req = requests.get(base_url, headers=headers)
print(req.status_code)
# output
503
为什么我们会收到 503 服务器错误?
让我们看看服务器返回的内容。
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
}
base_url = 'https://www.rbnz.govt.nz/research-and-publications/speeches/2021/speech2021-06-29'
req = requests.get(base_url, headers=headers)
print(req.text)
# output
truncated...
<title>Website unavailable - Reserve Bank of New Zealand - Te Pūtea Matua</title>
truncated...
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
truncated...
<form class="challenge-form" id="challenge-form" action="/research-and-publications/speeches/2021/speech2021-06-29?__cf_chl_jschl_tk__=73ad3f68fb15cc9284b25b7802626dd4ebe102cd-1625840173-0-ATQAZ5g7wCwLU2Q7agCqc1p59qs6ghpsYPVhDNwDN5r7vefk0P1UbjR4AJOUl0kUCZmDi-EVWX8XekL6VkqOgKTd1zqd5QWWlT3f2Dp_aUWQgCAH3bnS4x0wyc8-xGOLm-tcMKCXcTXH-OpiGoUX8u__bk1TIZ0gI_TYMB-oy0nJi7dMYLgJnvJhwhTllDoYUbCzmo2h2idIJPqIjNaAwupvbdpvHnrogPDnFhCe8Cco9-eKlq4w0G563f_OJ3M7YQChBjCoHYlT8baMoOLzP-Kb33rNmlG0uXhzoiIBROsPw9pavOrO1vsbqf31ZArDRuy0y7rsfrhAD7iU113zmypN81tgqgL_F8YTzygRvI_z3Cs2YOMxjB53-jq1pWwqsW_ItTaY7I3vh5lg_12EUzEddcwmuIj1wI2NbnA7EU06QNHYYn_Ye4TKM0gu9k4031hGybszE3nRKCdTXgMSgJbYhTJ6bJYPSb_2IHMUHlYyHksxePJ4C_5-5X8qIdJApSTFBfCLLLAZLrkFnBk7ep4" method="POST" enctype="application/x-www-form-urlencoded">
truncated...
var a = document.getElementById('cf-content');
truncated...
<p>Your access to the Reserve Bank website has been restricted. If you think you should be able to access our website please email <a href="mailto:web@rbnz.govt.nz">web@rbnz.govt.nz</a>.
如果我们查看返回的文本,我们可以看到该网站要求您的浏览器完成 challenge-form.
。如果您查看文本中的其他数据点(例如 cf-content
),您可以看到该网站受到 CloudFlare.
绕过此保护非常困难。这是我最近关于绕过这种保护的复杂性的回答之一。