Scrapy Python - 如何通过 URL 和检索 URL 进行抓取

Question

我对 python 的编程经验很少，对 Java 的编程经验更多。

我正在尝试进入 python，但在理解我尝试设置的 scrapy 网络爬虫方面遇到了问题。

该脚本将从网站上抓取产品等并将它们放入文件中，然后递归遍历网站内的所有登陆域，但在指定深度处停止。

我无法理解如何将脚本中执行的 URL 传递给我发现的 scrapy 示例。

执行我的蜘蛛程序的代码：

Scrappy 代码在这里--------------------------------

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(UrlScrappyRunner, domain="www.google.com")
process.start()

我的蜘蛛：

class UrlScrappyRunner(scrapy.Spider):

        name = "quotes"

        def start_requests(self):
            urls = [
                'http://quotes.toscrape.com/page/1/',
                'http://quotes.toscrape.com/page/2/',
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
            page = response.url.split("/")[-2]
            filename = 'quotes-%s.html' % page
            with open(filename, 'wb') as f:
                f.write(response.body)
            self.log('Saved file %s' % filename)

请告诉我如何将域=www.google.com 传递给我的蜘蛛，以便它抓取 google 而不是 quotes.toscrape.com？

Answer 1

您可以在 scrapy 中使用参数 -a 来传递用户定义的值

class UrlScrappyRunner(scrapy.Spider):
            name = "quotes"

           def __init__(self, domain=None, *args, **kwargs):
                self.domain = domain

            def start_requests(self):
                urls = self.domain

到运行参数

scrapy crawl UrlScrappyRunner -a domain="www.google.com"

到运行来自进程：

process.crawl(UrlScrappyRunner, domain="www.google.com")

在您的代码中添加 __init__ 并在您的 class 变量中分配域值

Answer 2

这是在 Scrapy 1.4.0 上将 url 作为参数传递给 CrawlerProcess 蜘蛛的方法：

main.py

from scrapy.crawler import CrawlerProcess
from myspider import MySpider

if __name__ == '__main__':
  process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'FEED_URI': destination_file_uri  # URI to destination file
  })
  process.crawl(MySpider, myurls=[
    'http://example.com'
  ])

其中 destination_file_uri 类似于 "file:///path/to/results.json"。

myspider.py

import scrapy
from scrapy.http.request import Request

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, *args, **kwargs):
        self.myurls = kwargs.get('myurls', [])
        super(MySpider, self).__init__(*args, **kwargs)

    def start_requests(self):
        for url in self.myurls:
            yield Request(url, self.parse)

    def parse(self, response):
        """ Test: extract quotes """
        for quote in response.xpath('//blockquote').extract():
            yield {"quote": quote}

其中 myurls 是一个尚未使用的属性名称（您可以更改它）。

Scrapy Python - 如何通过 URL 和检索 URL 进行抓取

Scrapy Python - How to Pass URL and retrieve URL for Scraping

python

scrapy-spider

Scrappy 代码在这里--------------------------------