Newspaper3k 仅在第一行导出到 csv
Newspaper3k export to csv on first row only
在 'Life is complex' 的帮助下,我设法从 CNN 新闻网站上抓取了数据。从中提取的数据 (URL) 保存在 .csv 文件 (test1) 中。请注意,这是手动完成的,因为这样做更容易!
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
import csv
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
with open('test1.csv', 'r') as file:
csv_file = file.readlines()
for url in csv_file:
try:
article = Article(url.strip(), config=config)
article.download()
article.parse()
print(article.title)
article_text = article.text.replace('\n', ' ')
print(article.text)
except ArticleException:
print('***FAILED TO DOWNLOAD***', article.url)
with open('test2.csv', 'a', newline='') as csvfile:
headers = ['article title', 'article text']
writer = csv.DictWriter(csvfile, lineterminator='\n', fieldnames=headers)
writer.writeheader()
writer.writerow({'article title': article.title,
'article text': article.text})
使用上面的代码,我设法从 URL 中抓取了实际的新闻信息(标题和内容),并将其导出到 .csv 文件。导出的唯一问题是,它只导出最后一个标题和文本(因此我认为它会不断覆盖第一行的信息)
如何获取csv文件中的所有标题和内容?
谢谢你给我的留言。
下面的代码应该可以帮助您解决 CSV 写入问题。如果它不只是让我知道,我会修改我的答案。
P.S。我将更新我的 Newspaper3k overview document 以提供有关编写 CSV 文件的更多详细信息。
P.P.S。我目前正在写一个新的 news scraper,因为 Newspaper3k 的开发已经结束。我不确定我的代码的发布日期。
import csv
from newspaper import Config
from newspaper import Article
from os.path import exists
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
urls = ['https://www.cnn.com/2021/10/25/tech/facebook-papers/index.html', 'https://www.cnn.com/entertainment/live-news/rust-shooting-alec-baldwin-10-25-21/h_257c62772a2b69cb37db397592971b58']
for url in urls:
article = Article(url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
published_date = {value for (key, value) in article_meta_data.items() if key == 'pubdate'}
article_published_date = " ".join(str(x) for x in published_date)
file_exists = exists('cnn_extraction_results.csv')
if not file_exists:
with open('cnn_extraction_results.csv', 'w', newline='') as file:
headers = ['date published', 'article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writeheader()
writer.writerow({'date published': article_published_date,
'article title': article.title,
'article text': article.text})
else:
with open('cnn_extraction_results.csv', 'a', newline='') as file:
headers = ['date published', 'article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writerow({'date published': article_published_date,
'article title': article.title,
'article text': article.text})
在 'Life is complex' 的帮助下,我设法从 CNN 新闻网站上抓取了数据。从中提取的数据 (URL) 保存在 .csv 文件 (test1) 中。请注意,这是手动完成的,因为这样做更容易!
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
import csv
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
with open('test1.csv', 'r') as file:
csv_file = file.readlines()
for url in csv_file:
try:
article = Article(url.strip(), config=config)
article.download()
article.parse()
print(article.title)
article_text = article.text.replace('\n', ' ')
print(article.text)
except ArticleException:
print('***FAILED TO DOWNLOAD***', article.url)
with open('test2.csv', 'a', newline='') as csvfile:
headers = ['article title', 'article text']
writer = csv.DictWriter(csvfile, lineterminator='\n', fieldnames=headers)
writer.writeheader()
writer.writerow({'article title': article.title,
'article text': article.text})
使用上面的代码,我设法从 URL 中抓取了实际的新闻信息(标题和内容),并将其导出到 .csv 文件。导出的唯一问题是,它只导出最后一个标题和文本(因此我认为它会不断覆盖第一行的信息)
如何获取csv文件中的所有标题和内容?
谢谢你给我的留言。
下面的代码应该可以帮助您解决 CSV 写入问题。如果它不只是让我知道,我会修改我的答案。
P.S。我将更新我的 Newspaper3k overview document 以提供有关编写 CSV 文件的更多详细信息。
P.P.S。我目前正在写一个新的 news scraper,因为 Newspaper3k 的开发已经结束。我不确定我的代码的发布日期。
import csv
from newspaper import Config
from newspaper import Article
from os.path import exists
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
urls = ['https://www.cnn.com/2021/10/25/tech/facebook-papers/index.html', 'https://www.cnn.com/entertainment/live-news/rust-shooting-alec-baldwin-10-25-21/h_257c62772a2b69cb37db397592971b58']
for url in urls:
article = Article(url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
published_date = {value for (key, value) in article_meta_data.items() if key == 'pubdate'}
article_published_date = " ".join(str(x) for x in published_date)
file_exists = exists('cnn_extraction_results.csv')
if not file_exists:
with open('cnn_extraction_results.csv', 'w', newline='') as file:
headers = ['date published', 'article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writeheader()
writer.writerow({'date published': article_published_date,
'article title': article.title,
'article text': article.text})
else:
with open('cnn_extraction_results.csv', 'a', newline='') as file:
headers = ['date published', 'article title', 'article text']
writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
writer.writerow({'date published': article_published_date,
'article title': article.title,
'article text': article.text})