处理报纸中的文章异常
Handling Article Exceptions in Newspaper
我有一些代码可以使用报纸查看各种媒体并从中下载文章。这已经运行了很长时间,但最近开始出现问题。我可以看出问题所在,但由于我是 Python 的新手,所以我不确定解决该问题的最佳方法。基本上(我认为)我需要进行修改以防止偶尔的格式错误的网址完全崩溃脚本,而是允许它放弃该网址并继续使用其他网址。
错误的来源是当我尝试使用以下方式下载文章时:
article.download()
有些文章(很明显他们每天都在变)会抛出下面的错误但是脚本继续到运行:
Traceback (most recent call last):
File "C:\Anaconda3\lib\encodings\idna.py", line 167, in encode
raise UnicodeError("label too long")
UnicodeError: label too long
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\newspaper\mthreading.py", line 38, in run
func(*args, **kargs)
File "C:\Anaconda3\lib\site-packages\newspaper\source.py", line 350, in download_articles
html = network.get_html(url, config=self.config)
File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 39, in get_html return get_html_2XX_only(url, config, response)
File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 60, in get_html_2XX_only url=url, **get_request_kwargs(timeout, useragent))
File "C:\Anaconda3\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 502, in request resp = self.send(prep, **send_kwargs)
File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 612, in send r = adapter.send(request, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\adapters.py", line 440, in send timeout=timeout
File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked)
File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 356, in _make_request conn.request(method, url, **httplib_request_kw)
File "C:\Anaconda3\lib\http\client.py", line 1107, in request self._send_request(method, url, body, headers)
File "C:\Anaconda3\lib\http\client.py", line 1152, in _send_request self.endheaders(body)
File "C:\Anaconda3\lib\http\client.py", line 1103, in endheaders self._send_output(message_body)
File "C:\Anaconda3\lib\http\client.py", line 934, in _send_output self.send(msg)
File "C:\Anaconda3\lib\http\client.py", line 877, in send self.connect()
File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 166, in connect conn = self._new_conn()
File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 141, in _new_conn (self.host, self.port), self.timeout, **extra_kw)
File "C:\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "C:\Anaconda3\lib\socket.py", line 733, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)
接下来应该对每篇文章进行解析和 运行 自然语言处理,并将某些元素写入数据框,这样我就有了:
for paper in papers:
for article in paper.articles:
article.parse()
print(article.title)
article.nlp()
if article.publish_date is None:
d = datetime.now().date()
else:
d = article.publish_date.date()
stories.loc[i] = [paper.brand, d, datetime.now().date(), article.title, article.summary, article.keywords, article.url]
i += 1
(这可能也有点草率,但这是另一天的问题)
这 运行 很好,直到它到达其中一个有错误的 URL,然后抛出文章异常并且脚本崩溃:
C:\Anaconda3\lib\site-packages\PIL\TiffImagePlugin.py:709: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0.
warnings.warn(str(msg))
ArticleException Traceback (most recent call last) <ipython-input-17-2106485c4bbb> in <module>()
4 for paper in papers:
5 for article in paper.articles:
----> 6 article.parse()
7 print(article.title)
8 article.nlp()
C:\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
183
184 def parse(self):
--> 185 self.throw_if_not_downloaded_verbose()
186
187 self.doc = self.config.get_parser().fromstring(self.html)
C:\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
519 if self.download_state == ArticleDownloadState.NOT_STARTED:
520 print('You must `download()` an article first!')
--> 521 raise ArticleException()
522 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
523 print('Article `download()` failed with %s on URL %s' %
ArticleException:
那么防止此问题终止我的脚本的最佳方法是什么?我应该在出现 unicode 错误的下载阶段解决它,还是在解析阶段通过告诉它忽略那些错误地址来解决它?我将如何着手实施该更正?
非常感谢任何建议。
我遇到了同样的问题,虽然通常使用 除了:不推荐通过 ,但以下对我有用:
try:
a.parse()
file.write( a.title+'\n')
except :
pass
我发现 Navid 对于这个确切问题是正确的。
然而,.parse() 只是可能使您出错的函数之一。我将所有调用包装在 try / accept 结构中,如下所示:
word_list = []
for words in google_news.articles:
try:
words.download()
words.parse()
words.nlp()
except:
pass
word_list.append(words.keywords)
您可以尝试捕获 ArticleException。不要忘记 import
报纸模块。
try:
article.download()
article.parse()
except newspaper.article.ArticleException:
# do something
我有一些代码可以使用报纸查看各种媒体并从中下载文章。这已经运行了很长时间,但最近开始出现问题。我可以看出问题所在,但由于我是 Python 的新手,所以我不确定解决该问题的最佳方法。基本上(我认为)我需要进行修改以防止偶尔的格式错误的网址完全崩溃脚本,而是允许它放弃该网址并继续使用其他网址。
错误的来源是当我尝试使用以下方式下载文章时:
article.download()
有些文章(很明显他们每天都在变)会抛出下面的错误但是脚本继续到运行:
Traceback (most recent call last):
File "C:\Anaconda3\lib\encodings\idna.py", line 167, in encode
raise UnicodeError("label too long")
UnicodeError: label too long
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\newspaper\mthreading.py", line 38, in run
func(*args, **kargs)
File "C:\Anaconda3\lib\site-packages\newspaper\source.py", line 350, in download_articles
html = network.get_html(url, config=self.config)
File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 39, in get_html return get_html_2XX_only(url, config, response)
File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 60, in get_html_2XX_only url=url, **get_request_kwargs(timeout, useragent))
File "C:\Anaconda3\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 502, in request resp = self.send(prep, **send_kwargs)
File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 612, in send r = adapter.send(request, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\adapters.py", line 440, in send timeout=timeout
File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked)
File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 356, in _make_request conn.request(method, url, **httplib_request_kw)
File "C:\Anaconda3\lib\http\client.py", line 1107, in request self._send_request(method, url, body, headers)
File "C:\Anaconda3\lib\http\client.py", line 1152, in _send_request self.endheaders(body)
File "C:\Anaconda3\lib\http\client.py", line 1103, in endheaders self._send_output(message_body)
File "C:\Anaconda3\lib\http\client.py", line 934, in _send_output self.send(msg)
File "C:\Anaconda3\lib\http\client.py", line 877, in send self.connect()
File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 166, in connect conn = self._new_conn()
File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 141, in _new_conn (self.host, self.port), self.timeout, **extra_kw)
File "C:\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "C:\Anaconda3\lib\socket.py", line 733, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)
接下来应该对每篇文章进行解析和 运行 自然语言处理,并将某些元素写入数据框,这样我就有了:
for paper in papers:
for article in paper.articles:
article.parse()
print(article.title)
article.nlp()
if article.publish_date is None:
d = datetime.now().date()
else:
d = article.publish_date.date()
stories.loc[i] = [paper.brand, d, datetime.now().date(), article.title, article.summary, article.keywords, article.url]
i += 1
(这可能也有点草率,但这是另一天的问题)
这 运行 很好,直到它到达其中一个有错误的 URL,然后抛出文章异常并且脚本崩溃:
C:\Anaconda3\lib\site-packages\PIL\TiffImagePlugin.py:709: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0.
warnings.warn(str(msg))
ArticleException Traceback (most recent call last) <ipython-input-17-2106485c4bbb> in <module>()
4 for paper in papers:
5 for article in paper.articles:
----> 6 article.parse()
7 print(article.title)
8 article.nlp()
C:\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
183
184 def parse(self):
--> 185 self.throw_if_not_downloaded_verbose()
186
187 self.doc = self.config.get_parser().fromstring(self.html)
C:\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
519 if self.download_state == ArticleDownloadState.NOT_STARTED:
520 print('You must `download()` an article first!')
--> 521 raise ArticleException()
522 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
523 print('Article `download()` failed with %s on URL %s' %
ArticleException:
那么防止此问题终止我的脚本的最佳方法是什么?我应该在出现 unicode 错误的下载阶段解决它,还是在解析阶段通过告诉它忽略那些错误地址来解决它?我将如何着手实施该更正?
非常感谢任何建议。
我遇到了同样的问题,虽然通常使用 除了:不推荐通过 ,但以下对我有用:
try:
a.parse()
file.write( a.title+'\n')
except :
pass
我发现 Navid 对于这个确切问题是正确的。
然而,.parse() 只是可能使您出错的函数之一。我将所有调用包装在 try / accept 结构中,如下所示:
word_list = []
for words in google_news.articles:
try:
words.download()
words.parse()
words.nlp()
except:
pass
word_list.append(words.keywords)
您可以尝试捕获 ArticleException。不要忘记 import
报纸模块。
try:
article.download()
article.parse()
except newspaper.article.ArticleException:
# do something