使用scrapy和xpath解析数据
Using scrapy and xpath to parse data
我一直在尝试抓取一些数据,但总是得到一个空白值或 None。我试过做下一个兄弟姐妹但失败了(我可能做错了)。非常感谢任何帮助。提前谢谢你。
要抓取的网站(最终):https://www.unegui.mn/azhild-avna/ulan-bator/
要测试的网站(当前,列表较少):https://www.unegui.mn/azhild-avna/mt-hariltsaa-holboo/slzhee-tehnik-hangamzh/ulan-bator/
代码段:
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(@class,'announcement-container')]")
# parse details
for card in cards:
company = card.xpath(".//*[@class='announcement-block__company-name']/text()").extract_first()
date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0]
city = date_block[1]
item = {'date': date,
'city': city,
'company': company
}
HTML 片段:
<div class="announcement-block__date">
<span class="announcement-block__company-name">Электро экспресс ХХК</span>
, Өчигдөр 13:05, Улаанбаатар</div>
预期输出:
date = Өчигдөр 13:05
city = Улаанбаатар
更新: 我想出了如何获取日期和城市数据的方法。我最终使用 follow next sibling 来获取日期,用逗号分隔,并获取第二个和第三个值。
date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/span/following-sibling::text())").extract_first().split(',')
date = date_block[1]
city = date_block[2]
额外:
如果有人能告诉我或推荐我如何设置我的管道文件,将不胜感激。使用管道是正确的还是应该使用 items.py?目前我在同一个项目文件夹中有 3 个蜘蛛:公寓、工作、汽车。我需要清理我的数据并进行转换。例如,对于我目前正在处理的工作蜘蛛,如上所示,我想创建以下操作:
- 如果工资 < 1000,则替换为字符串 'Negotiable'
- 如果日期包含文本“Өчигдөр”,则替换为 'Yesterday'
不删除时间
- 如果雇主包含值“Хувь хян”,则将公司值更改为“Хувь хян”
我的pipelines.py文件:
from itemadapter import ItemAdapter
class ScrapebooksPipeline:
def process_item(self, item, spider):
return item
我的items.py文件:
import scrapy
class ScrapebooksItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
您似乎缺少缩进。
相反
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(@class,'announcement-container')]")
# parse details
for card in cards: date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0]
city = date_block[1]
试试这个:
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(@class,'announcement-container')]")
# parse details
for card in cards: date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0]
city = date_block[1]
- 我将您的 xpath 更改为更小的范围。
- extract_first() 将获取第一个实例,因此请改用 getall()。
- 为了获得日期,我不得不使用正则表达式(大多数结果都有时间但没有日期,所以如果日期为空白就完全没问题了)。
- 我看不懂语言,所以我不得不猜测(有点)城市,但即使猜错了,你也能明白。
import scrapy
import re
class TempSpider(scrapy.Spider):
name = 'temp_spider'
allowed_domains = ['unegui.mn']
start_urls = ['https://www.unegui.mn/azhild-avna/ulan-bator/']
def parse(self, response, **kwargs):
cards = response.xpath('//div[@class="announcement-block__date"]')
# parse details
for card in cards:
company = card.xpath('.//span/text()').get()
date_block = card.xpath('./text()').getall()
date = date_block[1].strip()
date = re.findall(r'(\d+-\d+-\d+)', date)
if date:
date = date[0]
else:
date = ''
city = date_block[1].split(',')[2].strip()
item = {'date': date,
'city': city,
'company': company
}
yield item
输出:
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
{'date': '2021-11-07', 'city': 'Улаанбаатар', 'company': 'Arirang'}
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
{'date': '2021-11-11', 'city': 'Улаанбаатар', 'company': 'Altangadas'}
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
...
...
...
我一直在尝试抓取一些数据,但总是得到一个空白值或 None。我试过做下一个兄弟姐妹但失败了(我可能做错了)。非常感谢任何帮助。提前谢谢你。
要抓取的网站(最终):https://www.unegui.mn/azhild-avna/ulan-bator/
要测试的网站(当前,列表较少):https://www.unegui.mn/azhild-avna/mt-hariltsaa-holboo/slzhee-tehnik-hangamzh/ulan-bator/
代码段:
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(@class,'announcement-container')]")
# parse details
for card in cards:
company = card.xpath(".//*[@class='announcement-block__company-name']/text()").extract_first()
date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0]
city = date_block[1]
item = {'date': date,
'city': city,
'company': company
}
HTML 片段:
<div class="announcement-block__date">
<span class="announcement-block__company-name">Электро экспресс ХХК</span>
, Өчигдөр 13:05, Улаанбаатар</div>
预期输出:
date = Өчигдөр 13:05
city = Улаанбаатар
更新: 我想出了如何获取日期和城市数据的方法。我最终使用 follow next sibling 来获取日期,用逗号分隔,并获取第二个和第三个值。
date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/span/following-sibling::text())").extract_first().split(',')
date = date_block[1]
city = date_block[2]
额外:
如果有人能告诉我或推荐我如何设置我的管道文件,将不胜感激。使用管道是正确的还是应该使用 items.py?目前我在同一个项目文件夹中有 3 个蜘蛛:公寓、工作、汽车。我需要清理我的数据并进行转换。例如,对于我目前正在处理的工作蜘蛛,如上所示,我想创建以下操作:
- 如果工资 < 1000,则替换为字符串 'Negotiable'
- 如果日期包含文本“Өчигдөр”,则替换为 'Yesterday' 不删除时间
- 如果雇主包含值“Хувь хян”,则将公司值更改为“Хувь хян”
我的pipelines.py文件:
from itemadapter import ItemAdapter
class ScrapebooksPipeline:
def process_item(self, item, spider):
return item
我的items.py文件:
import scrapy
class ScrapebooksItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
您似乎缺少缩进。 相反
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(@class,'announcement-container')]")
# parse details
for card in cards: date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0]
city = date_block[1]
试试这个:
def parse(self, response, **kwargs):
cards = response.xpath("//li[contains(@class,'announcement-container')]")
# parse details
for card in cards: date_block = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first().split(',')
date = date_block[0]
city = date_block[1]
- 我将您的 xpath 更改为更小的范围。
- extract_first() 将获取第一个实例,因此请改用 getall()。
- 为了获得日期,我不得不使用正则表达式(大多数结果都有时间但没有日期,所以如果日期为空白就完全没问题了)。
- 我看不懂语言,所以我不得不猜测(有点)城市,但即使猜错了,你也能明白。
import scrapy
import re
class TempSpider(scrapy.Spider):
name = 'temp_spider'
allowed_domains = ['unegui.mn']
start_urls = ['https://www.unegui.mn/azhild-avna/ulan-bator/']
def parse(self, response, **kwargs):
cards = response.xpath('//div[@class="announcement-block__date"]')
# parse details
for card in cards:
company = card.xpath('.//span/text()').get()
date_block = card.xpath('./text()').getall()
date = date_block[1].strip()
date = re.findall(r'(\d+-\d+-\d+)', date)
if date:
date = date[0]
else:
date = ''
city = date_block[1].split(',')[2].strip()
item = {'date': date,
'city': city,
'company': company
}
yield item
输出:
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
{'date': '2021-11-07', 'city': 'Улаанбаатар', 'company': 'Arirang'}
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
{'date': '2021-11-11', 'city': 'Улаанбаатар', 'company': 'Altangadas'}
[scrapy.core.scraper] DEBUG: Scraped from <200 https://www.unegui.mn/azhild-avna/ulan-bator/>
...
...
...