使用 Scrapy 按标题抓取标签内容
Scraping content of a tag by title with Scrapy
我正在抓取房地产网站上的列表。
属性 详细信息位于 table 内,并且都具有相同的 class 名称
然而,有时,值的排序方式不同或缺失,因此当我 运行 我的蜘蛛时,我在错误的列中得到值
type = response.css('div.carac-value span::text').extract()[1]
year= response.css('div.carac-value span::text').extract()[2]
area = response.css('div.carac-value span::text').extract()[3]
(即在 属性 区域的列中我会得到它的建造年份)
我怎样才能只提取具有特定标题(如“Superficie nette”)的 class 的内容?
- 我使用
default=''
因为并非所有页面都具有这些属性(年份、类型、区域)
- 我正在使用 xpath 查找其中包含单词的特定 div,然后我们获取下一个兄弟的文本。
- 我把
type
改成了type1
。
scrapy shell
In [1]: url = 'https://www.centris.ca/fr/condo~a-vendre~montreal-rosemont-la-petite-patrie/19783085?view=Summary&uc=3'
In [2]: headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrom
...: e/74.0.3729.169 Safari/537.36'}
In [3]: req = scrapy.Request(url=url, headers=headers)
In [4]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centris.ca/fr/condo~a-vendre~montreal-rosemont-la-petite-patrie/19783085?view=Summary&uc=3> (referer: None)
In [5]: type1 = response.xpath('//div[@class="carac-title"][contains(text(), "Type")]/following-sibling::div[@class="ca
...: rac-value"]//text()').get(default='')
In [6]: year = response.xpath('//div[@class="carac-title"][contains(text(), "Année")]/following-sibling::div[@class="ca
...: rac-value"]//text()').get(default='')
In [7]: area = response.xpath('//div[@class="carac-title"][contains(text(), "Superficie")]/following-sibling::div[@clas
...: s="carac-value"]//text()').get(default='')
In [8]: type1
Out[8]: 'Divise'
In [9]: year
Out[9]: '2015'
In [10]: area
Out[10]: ''
# example with a page that has an area value
In [11]: url = 'https://www.centris.ca/fr/maison~a-vendre~laval-sainte-rose/19672961?view=Summary&uc=2'
In [12]: req = scrapy.Request(url=url, headers=headers)
In [13]: fetch(req)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centris.ca/fr/maison~a-vendre~laval-sainte-rose/19672961?view=Summary&uc=2> (referer: None)
In [14]: area = response.xpath('//div[@class="carac-title"][contains(text(), "Superficie")]/following-sibling::div[@cla
...: ss="carac-value"]//text()').get(default='')
In [15]: area
Out[15]: '7 500 pc'
我正在抓取房地产网站上的列表。
属性 详细信息位于 table 内,并且都具有相同的 class 名称
然而,有时,值的排序方式不同或缺失,因此当我 运行 我的蜘蛛时,我在错误的列中得到值
type = response.css('div.carac-value span::text').extract()[1]
year= response.css('div.carac-value span::text').extract()[2]
area = response.css('div.carac-value span::text').extract()[3]
(即在 属性 区域的列中我会得到它的建造年份) 我怎样才能只提取具有特定标题(如“Superficie nette”)的 class 的内容?
- 我使用
default=''
因为并非所有页面都具有这些属性(年份、类型、区域) - 我正在使用 xpath 查找其中包含单词的特定 div,然后我们获取下一个兄弟的文本。
- 我把
type
改成了type1
。
scrapy shell
In [1]: url = 'https://www.centris.ca/fr/condo~a-vendre~montreal-rosemont-la-petite-patrie/19783085?view=Summary&uc=3'
In [2]: headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrom
...: e/74.0.3729.169 Safari/537.36'}
In [3]: req = scrapy.Request(url=url, headers=headers)
In [4]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centris.ca/fr/condo~a-vendre~montreal-rosemont-la-petite-patrie/19783085?view=Summary&uc=3> (referer: None)
In [5]: type1 = response.xpath('//div[@class="carac-title"][contains(text(), "Type")]/following-sibling::div[@class="ca
...: rac-value"]//text()').get(default='')
In [6]: year = response.xpath('//div[@class="carac-title"][contains(text(), "Année")]/following-sibling::div[@class="ca
...: rac-value"]//text()').get(default='')
In [7]: area = response.xpath('//div[@class="carac-title"][contains(text(), "Superficie")]/following-sibling::div[@clas
...: s="carac-value"]//text()').get(default='')
In [8]: type1
Out[8]: 'Divise'
In [9]: year
Out[9]: '2015'
In [10]: area
Out[10]: ''
# example with a page that has an area value
In [11]: url = 'https://www.centris.ca/fr/maison~a-vendre~laval-sainte-rose/19672961?view=Summary&uc=2'
In [12]: req = scrapy.Request(url=url, headers=headers)
In [13]: fetch(req)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centris.ca/fr/maison~a-vendre~laval-sainte-rose/19672961?view=Summary&uc=2> (referer: None)
In [14]: area = response.xpath('//div[@class="carac-title"][contains(text(), "Superficie")]/following-sibling::div[@cla
...: ss="carac-value"]//text()').get(default='')
In [15]: area
Out[15]: '7 500 pc'