Newspaper3k 在提取时过滤掉坏的 URL
Newspaper3k filter out bad URL while extracting
在一些帮助下 ;) 我设法从 CNN 新闻网站上抓取了标题和内容并将其放入 .csv 文件中。
现在,带有 URL 的列表(已经用另一个代码提取)有一些错误的 URL。这个代码非常简单,因为它只是扫描网站和 returns 所有 URL。因此列表中有一些错误的 URL(例如 http://cnn.com/date/2021-10-17)
我没有搜索此列表并手动删除那些错误的 URL,而是想知道是否可以通过将我的代码更改为跳过错误的 URL 并继续下一个等等来解决这个问题。
示例代码:
import csv
from newspaper import Config
from newspaper import Article
from os.path import exists
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
urls = ['https://www.cnn.com/2021/10/25/tech/facebook-papers/index.html', 'http://cnn.com/date/2021-10-17', 'https://www.cnn.com/entertainment/live-news/rust-shooting-alec-baldwin-10-25-21/h_257c62772a2b69cb37db397592971b58']
# the above normally would be where I refer to the .csv file with URLs
for url in urls:
article = Article(url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
file_exists = exists('cnn_extraction_results.csv')
if not file_exists:
with open('cnn_extraction_results.csv', 'w', newline='') as file:
headers = ['article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writeheader()
writer.writerow({'article title': article.title,
'article text': article.text})
else:
with open('cnn_extraction_results.csv', 'a', newline='') as file:
headers = ['article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writerow({'article title': article.title,
'article text': article.text})
试试这个:
import csv
from os.path import exists
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
urls = ['https://www.cnn.com/2021/10/25/tech/facebook-papers/index.html',
'http://cnn.com/date/2021-10-17',
'https://www.cnn.com/entertainment/live-news/rust-shooting-alec-baldwin-10-25-21/h_257c62772a2b69cb37db397592971b58']
for url in urls:
try:
article = Article(url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
file_exists = exists('cnn_extraction_results.csv')
if not file_exists:
with open('cnn_extraction_results.csv', 'w', newline='') as file:
headers = ['article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writeheader()
writer.writerow({'article title': article.title,
'article text': article.text})
else:
with open('cnn_extraction_results.csv', 'a', newline='') as file:
headers = ['article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writerow({'article title': article.title,
'article text': article.text})
except ArticleException:
print('***FAILED TO DOWNLOAD***', url)
在一些帮助下 ;) 我设法从 CNN 新闻网站上抓取了标题和内容并将其放入 .csv 文件中。
现在,带有 URL 的列表(已经用另一个代码提取)有一些错误的 URL。这个代码非常简单,因为它只是扫描网站和 returns 所有 URL。因此列表中有一些错误的 URL(例如 http://cnn.com/date/2021-10-17) 我没有搜索此列表并手动删除那些错误的 URL,而是想知道是否可以通过将我的代码更改为跳过错误的 URL 并继续下一个等等来解决这个问题。
示例代码:
import csv
from newspaper import Config
from newspaper import Article
from os.path import exists
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
urls = ['https://www.cnn.com/2021/10/25/tech/facebook-papers/index.html', 'http://cnn.com/date/2021-10-17', 'https://www.cnn.com/entertainment/live-news/rust-shooting-alec-baldwin-10-25-21/h_257c62772a2b69cb37db397592971b58']
# the above normally would be where I refer to the .csv file with URLs
for url in urls:
article = Article(url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
file_exists = exists('cnn_extraction_results.csv')
if not file_exists:
with open('cnn_extraction_results.csv', 'w', newline='') as file:
headers = ['article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writeheader()
writer.writerow({'article title': article.title,
'article text': article.text})
else:
with open('cnn_extraction_results.csv', 'a', newline='') as file:
headers = ['article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writerow({'article title': article.title,
'article text': article.text})
试试这个:
import csv
from os.path import exists
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
urls = ['https://www.cnn.com/2021/10/25/tech/facebook-papers/index.html',
'http://cnn.com/date/2021-10-17',
'https://www.cnn.com/entertainment/live-news/rust-shooting-alec-baldwin-10-25-21/h_257c62772a2b69cb37db397592971b58']
for url in urls:
try:
article = Article(url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
file_exists = exists('cnn_extraction_results.csv')
if not file_exists:
with open('cnn_extraction_results.csv', 'w', newline='') as file:
headers = ['article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writeheader()
writer.writerow({'article title': article.title,
'article text': article.text})
else:
with open('cnn_extraction_results.csv', 'a', newline='') as file:
headers = ['article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writerow({'article title': article.title,
'article text': article.text})
except ArticleException:
print('***FAILED TO DOWNLOAD***', url)