抓取缓存页面

Question

我正在使用 scrapy 以这种方式获取一些网络内容：

class PitchforkTracks(scrapy.Spider):
    name = "pitchfork_tracks"
    allowed_domains = ["pitchfork.com"]
    start_urls = [
                    "http://pitchfork.com/reviews/best/tracks/?page=1",
                    "http://pitchfork.com/reviews/best/tracks/?page=2",
                    "http://pitchfork.com/reviews/best/tracks/?page=3",
     ]

一切正常。

现在，我不想直接点击页面，而是想抓取相同页面的 google caches。

什么是正确的syntax实现该目标？

PS:我已经试过了"cache:http://pitchfork.com/reviews/best/tracks/?page=1",，没用。

Answer 1

您可以使用以下 Google URL 来抓取缓存页面

http://webcache.googleusercontent.com/search?q=cache:http://pitchfork.com/reviews/best/tracks/?page=1

抓取缓存页面

Scraping cached pages

python

browser-cache

scrapy