使用 Scrapy Splash 将响应存储为文件

Question

我正在使用 Splash 创建我的第一个 scrapy 项目并使用来自 http://quotes.toscrape.com/js/ 的测试数据我想将每个页面的引号作为单独的文件存储在磁盘上（在下面的代码中，我首先尝试存储整个页面）。我有下面的代码，当我不使用 SplashRequest 时它可以工作，但是使用下面的新代码，当我 'Run and debug' 在 Visual Studio 代码中使用此代码时，现在没有任何内容存储在磁盘上。此外 self.log 不会写入我的可视代码终端 window。我是 Splash 的新手，所以我确定我遗漏了什么，但是什么？

已检查 here and here。

import scrapy
from scrapy_splash import SplashRequest

class QuoteItem(scrapy.Item):
    author = scrapy.Field()
    quote = scrapy.Field()   

class MySpider(scrapy.Spider):
    name = "jsscraper"

    
    start_urls = ["http://quotes.toscrape.com/js/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')

    def parse(self, response):
        for q in response.css("div.quote"):            
            quote = QuoteItem()
            quote["author"] = q.css(".author::text").extract_first()
            quote["quote"] = q.css(".text::text").extract_first()
            yield quote

        #cycle through all available pages
        for a in response.css('ul.pager a'):
            yield SplashRequest(url=a,callback=self.parse,endpoint='render.html',args={ 'wait': 0.5 })

       
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

更新 1

我如何调试它：

在 Visual Studio 代码中，按 F5
Select 'Python file'

输出选项卡为空

终端选项卡包含：

PS C:\scrapy\tutorial>  cd 'c:\scrapy\tutorial'; & 'C:\Users\Mark\AppData\Local\Programs\Python\Python38-32\python.exe' 'c:\Users\Mark\.vscode\extensions\ms-python.python-2020.9.114305\pythonFiles\lib\python\debugpy\launcher' '58582' '--' 'c:\scrapy\tutorial\spiders\quotes_spider_js.py'
PS C:\scrapy\tutorial>

此外，我的 Docker 容器实例中没有任何记录，我认为 Splash 首先需要它才能工作。

更新 2

I 运行 scrapy crawl jsscraper 和一个文件 'quotes-js.html' 存储在磁盘上。但是，它包含未执行任何 JavaScript 代码的页面源 HTML。我希望在 'http://quotes.toscrape.com/js/' 上执行 JS 代码并仅存储引用内容。我该怎么做？

Answer 1

问题

JavaScript 您希望抓取的网站未被执行。

解决方案

增加 ScrappyRequest 等待时间以允许 JavaScript 执行。

示例

yield SplashRequest(
    url=url,
    callback=self.parse,
    endpoint='render.html',
    args={ 'wait': 0.5 }
)

Answer 2

正在将输出写入 JSON 文件：

我已经尽力解决你的问题了。这是您的代码的工作版本。我希望这就是您要实现的目标。

import json

import scrapy
from scrapy_splash import SplashRequest


class MySpider(scrapy.Spider):
    name = "jsscraper"

    start_urls = ["http://quotes.toscrape.com/js/page/"+str(i+1) for i in range(10)]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url=url,
                callback=self.parse,
                endpoint='render.html',
                args={'wait': 0.5}
            )

    def parse(self, response):
        quotes = {"quotes": []}
        for q in response.css("div.quote"):
            quote = dict()
            quote["author"] = q.css(".author::text").extract_first()
            quote["quote"] = q.css(".text::text").extract_first()
            quotes["quotes"].append(quote)

        page = response.url[response.url.index("page/")+5:]
        print("page=", page)
        filename = 'quotes-%s.json' % page
        with open(filename, 'w') as outfile:
            outfile.write(json.dumps(quotes, indent=4, separators=(',', ":")))

更新： 上面的代码已更新为从所有页面中抓取并将结果保存在从第 1 页到第 10 页的单独 json 文件中。

这会将每页的引文列表写入单独的 json 文件，如下所示：

{
    "quotes":[
        {
            "author":"Albert Einstein",
            "quote":"\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
        },
        {
            "author":"J.K. Rowling",
            "quote":"\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"
        },
        {
            "author":"Albert Einstein",
            "quote":"\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"
        },
        {
            "author":"Jane Austen",
            "quote":"\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
        },
        {
            "author":"Marilyn Monroe",
            "quote":"\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"
        },
        {
            "author":"Albert Einstein",
            "quote":"\u201cTry not to become a man of success. Rather become a man of value.\u201d"
        },
        {
            "author":"Andr\u00e9 Gide",
            "quote":"\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"
        },
        {
            "author":"Thomas A. Edison",
            "quote":"\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"
        },
        {
            "author":"Eleanor Roosevelt",
            "quote":"\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"
        },
        {
            "author":"Steve Martin",
            "quote":"\u201cA day without sunshine is like, you know, night.\u201d"
        }
    ]
}

使用 Scrapy Splash 将响应存储为文件

Storing responses as files using Scrapy Splash

python

scrapy

web-scraping

scrapy-splash

splash-js-render

问题

解决方案

示例