使用scrapy和xpath解析数据

Using scrapy and xpath to parse data

我一直在尝试抓取一些数据,但总是得到一个空白值或 None。我试过做下一个兄弟姐妹但失败了(我可能做错了)。非常感谢任何帮助。提前谢谢你。

要抓取的网站(最终):https://www.unegui.mn/azhild-avna/ulan-bator/

要测试的网站(当前,列表较少):https://www.unegui.mn/azhild-avna/mt-hariltsaa-holboo/slzhee-tehnik-hangamzh/ulan-bator/

代码段:

def parse(self, response, **kwargs):
    cards = response.xpath("//li[contains(@class,'announcement-container')]")
    # parse details
    for card in cards: 
    company = card.xpath(".//*[@class='announcement-block__company-name']/text()").extract_first()
    date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
    date = date_block[0]
    city = date_block[1]

    item = {'date': date,
           'city': city,
           'company': company
           }

HTML 片段:

<div class="announcement-block__date">
<span class="announcement-block__company-name">Электро экспресс ХХК</span>
,          Өчигдөр 13:05,                  Улаанбаатар</div>

预期输出:

date = Өчигдөр 13:05
city = Улаанбаатар

更新: 我想出了如何获取日期和城市数据的方法。我最终使用 follow next sibling 来获取日期,用逗号分隔,并获取第二个和第三个值。

    date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/span/following-sibling::text())").extract_first().split(',')
    date = date_block[1]
    city = date_block[2]

额外:

如果有人能告诉我或推荐我如何设置我的管道文件,将不胜感激。使用管道是正确的还是应该使用 items.py?目前我在同一个项目文件夹中有 3 个蜘蛛:公寓、工作、汽车。我需要清理我的数据并进行转换。例如,对于我目前正在处理的工作蜘蛛,如上所示,我想创建以下操作:

我的pipelines.py文件:

from itemadapter import ItemAdapter


class ScrapebooksPipeline:
    def process_item(self, item, spider):
        return item

我的items.py文件:

import scrapy


class ScrapebooksItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

您似乎缺少缩进。 相反

def parse(self, response, **kwargs):
    cards = response.xpath("//li[contains(@class,'announcement-container')]")
    # parse details
    for card in cards: date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
    date = date_block[0]
    city = date_block[1]

试试这个:

def parse(self, response, **kwargs):
    cards = response.xpath("//li[contains(@class,'announcement-container')]")
    # parse details
    for card in cards: date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
        date = date_block[0]
        city = date_block[1]
  1. 我将您的 xpath 更改为更小的范围。
  2. extract_first() 将获取第一个实例,因此请改用 getall()。
  3. 为了获得日期,我不得不使用正则表达式(大多数结果都有时间但没有日期,所以如果日期为空白就完全没问题了)。
  4. 我看不懂语言,所以我不得不猜测(有点)城市,但即使猜错了,你也能明白。
import scrapy
import re


class TempSpider(scrapy.Spider):
    name = 'temp_spider'
    allowed_domains = ['unegui.mn']
    start_urls = ['https://www.unegui.mn/azhild-avna/ulan-bator/']

    def parse(self, response, **kwargs):
        cards = response.xpath('//div[@class="announcement-block__date"]')

        # parse details
        for card in cards:
            company = card.xpath('.//span/text()').get()

            date_block = card.xpath('./text()').getall()

            date = date_block[1].strip()
            date = re.findall(r'(\d+-\d+-\d+)', date)
            if date:
                date = date[0]
            else:
                date = ''

            city = date_block[1].split(',')[2].strip()

            item = {'date': date,
                    'city': city,
                    'company': company
                    }
            yield item

输出:

[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
{'date': '2021-11-07', 'city': 'Улаанбаатар', 'company': 'Arirang'}
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
{'date': '2021-11-11', 'city': 'Улаанбаатар', 'company': 'Altangadas'}
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
...
...
...