Python 带有网络存档的报纸(回程机)
Python Newspaper with web archive (wayback machine)
我正在尝试使用 Python 库 newspaper with the archives from the Wayback Machine,它存储已存档的旧版本网站。理论上,旧新闻文章可以从这些档案中查询和下载。
例如,以下代码查询 CNBC 的档案以获取特定的存档日期。
import newspaper
url = 'http://web.archive.org/web/20161201123529/http://www.cnbc.com/'
paper = newspaper.build(url, memoize_articles = False )
虽然存档网站本身包含指向 2016-12-01 的实际新闻文章的链接,但报纸模块似乎没有提取它们。相反,您会得到 url,例如:
https://blog.archive.org/2016/10/23/defining-web-pages-web-sites-and-web-captures/
这不是来自 CNBC 存档版本的实际文章。然而,报纸与 today 版本的 CNBC.
配合得很好
我想它会因为 url(包含两个 http
的格式)而变得混乱。有人对如何从 Wayback Machine 档案中提取文章有什么建议吗?
这是一个有趣的问题,我会将其添加到 GitHub 上可用的 Newspaper Usage Overview 文档中。
我尝试使用 newspaper.build,但我无法使其正常工作,所以我使用了 newspaper Source。
from time import sleep
from random import randint
from newspaper import Config
from newspaper import Source
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
wayback_cnbc = Source(url='https://web.archive.org/web/20180301012621/https://www.cnbc.com/', config=config,
memoize_articles=False, language='en', number_threads=20, thread_timeout_seconds=2)
wayback_cnbc.build()
for article_extract in wayback_cnbc.articles:
article_extract.download()
article_extract.parse()
print(article_extract.publish_date)
print(article_extract.title)
print(article_extract.url)
print('')
# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(1,3))
上面的示例输出如下:
None
Media
https://web.archive.org/web/20180301012621/https://www.cnbc.com/media/
None
CNBC Video
https://web.archive.org/web/20180301012621/https://www.cnbc.com/video/
2017-11-08 00:00:00
CNBC Healthy Returns
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2017/11/08/healthy-returns.html
2018-02-28 00:00:00
Markets in Asia decline as dollar steadies; Nikkei falls 307 points
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/asia-markets-stocks-dollar-and-china-caixin-pmi-in-focus.html
2018-02-28 00:00:00
S&P 500 rises, but on track to snap longest monthly win streak since 1959
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/us-stocks-interest-rates-fed-markets.html
希望这个答案有助于您查询 WayBack Machine 文章的用例。如果您有任何问题,请告诉我。
我正在尝试使用 Python 库 newspaper with the archives from the Wayback Machine,它存储已存档的旧版本网站。理论上,旧新闻文章可以从这些档案中查询和下载。
例如,以下代码查询 CNBC 的档案以获取特定的存档日期。
import newspaper
url = 'http://web.archive.org/web/20161201123529/http://www.cnbc.com/'
paper = newspaper.build(url, memoize_articles = False )
虽然存档网站本身包含指向 2016-12-01 的实际新闻文章的链接,但报纸模块似乎没有提取它们。相反,您会得到 url,例如:
https://blog.archive.org/2016/10/23/defining-web-pages-web-sites-and-web-captures/
这不是来自 CNBC 存档版本的实际文章。然而,报纸与 today 版本的 CNBC.
配合得很好我想它会因为 url(包含两个 http
的格式)而变得混乱。有人对如何从 Wayback Machine 档案中提取文章有什么建议吗?
这是一个有趣的问题,我会将其添加到 GitHub 上可用的 Newspaper Usage Overview 文档中。
我尝试使用 newspaper.build,但我无法使其正常工作,所以我使用了 newspaper Source。
from time import sleep
from random import randint
from newspaper import Config
from newspaper import Source
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
wayback_cnbc = Source(url='https://web.archive.org/web/20180301012621/https://www.cnbc.com/', config=config,
memoize_articles=False, language='en', number_threads=20, thread_timeout_seconds=2)
wayback_cnbc.build()
for article_extract in wayback_cnbc.articles:
article_extract.download()
article_extract.parse()
print(article_extract.publish_date)
print(article_extract.title)
print(article_extract.url)
print('')
# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(1,3))
上面的示例输出如下:
None
Media
https://web.archive.org/web/20180301012621/https://www.cnbc.com/media/
None
CNBC Video
https://web.archive.org/web/20180301012621/https://www.cnbc.com/video/
2017-11-08 00:00:00
CNBC Healthy Returns
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2017/11/08/healthy-returns.html
2018-02-28 00:00:00
Markets in Asia decline as dollar steadies; Nikkei falls 307 points
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/asia-markets-stocks-dollar-and-china-caixin-pmi-in-focus.html
2018-02-28 00:00:00
S&P 500 rises, but on track to snap longest monthly win streak since 1959
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/us-stocks-interest-rates-fed-markets.html
希望这个答案有助于您查询 WayBack Machine 文章的用例。如果您有任何问题,请告诉我。