将 url 传递到从 RabbitMQ 使用的 scrapy 中的解析方法

Pass the url into the parse method in scrapy that was consumed from RabbitMQ

我正在使用 scrapy 来消费来自 RabbitMQ 的消息 (url),但是当我使用 yield 调用解析方法时将我的 url 作为参数传递。程序没有进入回调 method.Below 是我的蜘蛛的以下代码

# -*- coding: utf-8 -*-
import scrapy
import pika
from scrapy import cmdline
import json

class MydeletespiderSpider(scrapy.Spider):
    name = 'Mydeletespider'
    allowed_domains = []
    start_urls = []

def callback(self,ch, method, properties, body):
    print(" [x] Received %r" % body)
    body=json.loads(body)
    url=body.get('url')
    yield scrapy.Request(url=url,callback=self.parse)

def start_requests(self):
    cre = pika.PlainCredentials('test', 'test')
    connection = pika.BlockingConnection(
        pika.ConnectionParameters(host='10.0.12.103', port=5672, credentials=cre, socket_timeout=60))
    channel = connection.channel()



    channel.basic_consume(self.callback,
                          queue='Deletespider_Batch_Test',
                          no_ack=True)


    print(' [*] Waiting for messages. To exit press CTRL+C')
    channel.start_consuming()

def parse(self, response):
    print response.url
    pass

cmdline.execute('scrapy crawl Mydeletespider'.split())

我的目标是将 url 响应传递给解析方法

要使用来自 rabbitmq 的 url,您可以查看 scrapy-rabbitmq 包:

Scrapy-rabbitmq is a tool that lets you feed and queue URLs from RabbitMQ via Scrapy spiders, using the Scrapy framework.

要启用它,请在 settings.py:

中设置这些值
# Enables scheduling storing requests queue in rabbitmq.
SCHEDULER = "scrapy_rabbitmq.scheduler.Scheduler"
# Don't cleanup rabbitmq queues, allows to pause/resume crawls.
SCHEDULER_PERSIST = True
# Schedule requests using a priority queue. (default)
SCHEDULER_QUEUE_CLASS = 'scrapy_rabbitmq.queue.SpiderQueue'
# RabbitMQ Queue to use to store requests
RABBITMQ_QUEUE_NAME = 'scrapy_queue'
# Provide host and port to RabbitMQ daemon
RABBITMQ_CONNECTION_PARAMETERS = {'host': 'localhost', 'port': 6666}

# Bonus:
# Store scraped item in rabbitmq for post-processing.
# ITEM_PIPELINES = {
#    'scrapy_rabbitmq.pipelines.RabbitMQPipeline': 1
# }

在你的蜘蛛中:

from scrapy import Spider
from scrapy_rabbitmq.spiders import RabbitMQMixin

class RabbitSpider(RabbitMQMixin, Spider):
    name = 'rabbitspider'

    def parse(self, response):
        # mixin will take urls from rabbit queue by itself
        pass

参考这个:http://30daydo.com/article/512

def start_requests(self) 这个函数应该 return 一个生成器,否则 scrapy 将无法工作。