newsletter3k_does 它的功能对存储的数据起作用,我已经下载了 URL 的内容
newsletter3k_does its funtions work on stored data,I already downloaded contents of the URL
GitHub here 中的 newspaper3k 是一个非常有用的图书馆。目前,它适用于 python3。我想知道它是否可以处理 downloaded/stored 文本。重点是我们已经下载了 URL 的内容,并且不想在每次使用某些功能(关键字、摘要、日期...)时都重复此操作。例如,我们想查询存储的数据以获取日期和作者。明显的代码执行流程 1.download、2.parse,提取各种信息:文本、标题、图像……对我来说,这看起来像是一个连锁反应,总是从下载开始:
>>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
>>> article = Article(url)
>>> article.download()
>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> article.parse()
>>> article.authors
['Leigh Ann Caldwell', 'John Honway']
>>> article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)
>>> article.text
'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
>>> article.top_image
'http://someCDN.com/blah/blah/blah/file.png'
在您发表关于使用“ctrl+s”并保存新闻来源的评论后,我删除了我的第一个答案并添加了这个。
我将这篇文章的内容 -- https://www.latimes.com/business/story/2021-02-08/tesla-invests-in-bitcoin -- 下载到我的文件系统中。
下面的示例显示了如何从我的本地文件系统查询这篇文章。
from newspaper import Article
with open("Elon Musk's Bitcoin embrace is a bit of a head-scratcher - Los Angeles Times.htm", 'r') as f:
# note the empty URL string
article = Article('', language='en')
article.download(input_html=f.read())
article.parse()
article_meta_data = article.meta_data
article_published_date = ''.join({value for (key, value) in article_meta_data['article'].items()
if key == 'published_time'})
print(article_published_date)
# output
2021-02-08T15:52:56.252
print(article.title)
# output
Elon Musk’s Bitcoin embrace is a bit of a head-scratcher
article_author = {value for (key, value) in article_meta_data['article'].items() if key == 'author'}
print(''.join(article_author).rsplit('/', 1)[-1])
# output
russ-mitchell
article_summary = ''.join({value for (key, value) in article_meta_data['og'].items() if key == 'description'})
print(article_summary)
# output
The Tesla CEO says climate change is a threat to humanity, but his endorsement is driving demand for a cryptocurrency with a massive carbon footprint.
GitHub here 中的 newspaper3k 是一个非常有用的图书馆。目前,它适用于 python3。我想知道它是否可以处理 downloaded/stored 文本。重点是我们已经下载了 URL 的内容,并且不想在每次使用某些功能(关键字、摘要、日期...)时都重复此操作。例如,我们想查询存储的数据以获取日期和作者。明显的代码执行流程 1.download、2.parse,提取各种信息:文本、标题、图像……对我来说,这看起来像是一个连锁反应,总是从下载开始:
>>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
>>> article = Article(url)
>>> article.download()
>>> article.html
'<!DOCTYPE HTML><html itemscope itemtype="http://...'
>>> article.parse()
>>> article.authors
['Leigh Ann Caldwell', 'John Honway']
>>> article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)
>>> article.text
'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
>>> article.top_image
'http://someCDN.com/blah/blah/blah/file.png'
在您发表关于使用“ctrl+s”并保存新闻来源的评论后,我删除了我的第一个答案并添加了这个。
我将这篇文章的内容 -- https://www.latimes.com/business/story/2021-02-08/tesla-invests-in-bitcoin -- 下载到我的文件系统中。
下面的示例显示了如何从我的本地文件系统查询这篇文章。
from newspaper import Article
with open("Elon Musk's Bitcoin embrace is a bit of a head-scratcher - Los Angeles Times.htm", 'r') as f:
# note the empty URL string
article = Article('', language='en')
article.download(input_html=f.read())
article.parse()
article_meta_data = article.meta_data
article_published_date = ''.join({value for (key, value) in article_meta_data['article'].items()
if key == 'published_time'})
print(article_published_date)
# output
2021-02-08T15:52:56.252
print(article.title)
# output
Elon Musk’s Bitcoin embrace is a bit of a head-scratcher
article_author = {value for (key, value) in article_meta_data['article'].items() if key == 'author'}
print(''.join(article_author).rsplit('/', 1)[-1])
# output
russ-mitchell
article_summary = ''.join({value for (key, value) in article_meta_data['og'].items() if key == 'description'})
print(article_summary)
# output
The Tesla CEO says climate change is a threat to humanity, but his endorsement is driving demand for a cryptocurrency with a massive carbon footprint.