使用 Scrapy 按标题抓取标签内容

Question

我正在抓取房地产网站上的列表。属性详细信息位于 table 内，并且都具有相同的 class 名称

然而，有时，值的排序方式不同或缺失，因此当我运行我的蜘蛛时，我在错误的列中得到值

            type = response.css('div.carac-value span::text').extract()[1]
            year= response.css('div.carac-value span::text').extract()[2]
            area = response.css('div.carac-value span::text').extract()[3]

（即在属性区域的列中我会得到它的建造年份）我怎样才能只提取具有特定标题（如“Superficie nette”）的 class 的内容？

Answer 1

我使用 default='' 因为并非所有页面都具有这些属性（年份、类型、区域）
我正在使用 xpath 查找其中包含单词的特定 div，然后我们获取下一个兄弟的文本。
我把type改成了type1。

scrapy shell

In [1]: url = 'https://www.centris.ca/fr/condo~a-vendre~montreal-rosemont-la-petite-patrie/19783085?view=Summary&uc=3'

In [2]: headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrom
   ...: e/74.0.3729.169 Safari/537.36'}

In [3]: req = scrapy.Request(url=url, headers=headers)

In [4]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centris.ca/fr/condo~a-vendre~montreal-rosemont-la-petite-patrie/19783085?view=Summary&uc=3> (referer: None)

In [5]: type1 = response.xpath('//div[@class="carac-title"][contains(text(), "Type")]/following-sibling::div[@class="ca
   ...: rac-value"]//text()').get(default='')

In [6]: year = response.xpath('//div[@class="carac-title"][contains(text(), "Année")]/following-sibling::div[@class="ca
   ...: rac-value"]//text()').get(default='')

In [7]: area = response.xpath('//div[@class="carac-title"][contains(text(), "Superficie")]/following-sibling::div[@clas
   ...: s="carac-value"]//text()').get(default='')

In [8]: type1
Out[8]: 'Divise'

In [9]: year
Out[9]: '2015'

In [10]: area
Out[10]: ''

# example with a page that has an area value
In [11]: url = 'https://www.centris.ca/fr/maison~a-vendre~laval-sainte-rose/19672961?view=Summary&uc=2'

In [12]: req = scrapy.Request(url=url, headers=headers)

In [13]: fetch(req)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centris.ca/fr/maison~a-vendre~laval-sainte-rose/19672961?view=Summary&uc=2> (referer: None)

In [14]: area = response.xpath('//div[@class="carac-title"][contains(text(), "Superficie")]/following-sibling::div[@cla
    ...: ss="carac-value"]//text()').get(default='')

In [15]: area
Out[15]: '7 500 pc'

使用 Scrapy 按标题抓取标签内容

Scraping content of a tag by title with Scrapy

scrapy

web-scraping