Scrapy Spider 跟随 url，但不会导出数据

Question

我正在尝试从房地产列表页面获取详细信息。我可以抓取所有数据，但我似乎无法导出它..

可能是我使用 yield 关键字的方式有问题。代码的大部分工作：

访问第 1 页，示例。com/kittens
转到第 2 页，示例。com/puppers。这里列出了 10 套公寓。我可以从每个块中获取数据，但我需要来自超链接内部的其他信息。

访问超链接，例如.com/puppers/apartment1。它也从这里获取一些信息，但我似乎无法 return 将这些数据包含在我的 HousingItem() class.

中

import scrapy
from urllib.parse import urljoin

class HousingItem(scrapy.Item):
     street      = scrapy.Field()
     postal      = scrapy.Field()
     city        = scrapy.Field()
     url         = scrapy.Field()

     buildY         = scrapy.Field()
     on_m           = scrapy.Field()
     off_m          = scrapy.Field()


class FAppSpider(scrapy.Spider):
    name = 'f_app'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com/kittens']

    def parse(self, response):

         yield scrapy.Request(url="https://www.example.com/puppers",
             callback=self.parse_puppers)   

    def parse_inside_pupper(self, response):

         item = HousingItem()
         item['buildY']          = response.xpath('').extract_first().strip()
         item['on_m']            = response.xpath('').extract_first().strip()
         item['off_m']           = response.xpath('').extract_first().strip()


    def parse_puppers(self, response):

         base_url = 'https://www.example.com/'
         for block in response.css('div.search-result-main'):

              item = HousingItem()
              item['street']          = block.css(''),
              item['postcode']        = block.css(''),
              item['city']            = block.css('')
              item['url']             = urljoin(base_url, block.css('div.search-result-header > a::attr(href)')[0].extract())

            # Problem area from here.. 

              yield response.follow(url=item['url'],callback=self.parse_inside_pupper)

            # yield scrapy.request(url=item['url'],callback=self.parse_inside_pupper)?

              yield item

FEED_EXPORT_FIELDS在我的SETTINGS.py中调整。 parse_puppers() 中的 4 个项目正确导出，parse_inside_puppers() 控制台中的数据正确，但不会导出。

我用scrapy crawl f_app -o raw_data.csv来运行我蜘蛛。在此先感谢，感谢所有帮助。

p.s。我对 python 和练习还很陌生，我打赌你注意到了。

Answer 1

您需要使用 meta 参数将电流 item 发送到 parse_inside_pupper：

def parse_puppers(self, response):

     base_url = 'https://www.example.com/'
     for block in response.css('div.search-result-main'):

          item = HousingItem()
          item['street']          = block.css(''),
          item['postcode']        = block.css(''),
          item['city']            = block.css('')
          item['url']             = urljoin(base_url, block.css('div.search-result-header > a::attr(href)')[0].extract())

          yield response.follow(url=item['url'],callback=self.parse_inside_pupper, meta={"item": item})

之后你可以在 parse_inside_pupper 中使用它（和 yield 从这里）：

def parse_inside_pupper(self, response):

     item = response.meta["item"]
     item['buildY']          = response.xpath('').extract_first().strip()
     item['on_m']            = response.xpath('').extract_first().strip()
     item['off_m']           = response.xpath('').extract_first().strip()
     yield item

Scrapy Spider 跟随 url，但不会导出数据

Scrapy Spider following urls, but wont export the data

python

scrapy

web-scraping

scrapy-spider