从 json 响应 scrapy 抓取数据
scrape data from json response scrapy
我正在尝试抓取一个网站,但该网站以 json 格式存储所需数据。我使用 spider 从 url 获取数据。我无法理解 json 响应,因为我是 scrapy 的初学者,尤其是 json 格式,我想提取 dogId 和 msgTimeOff。我随机尝试有时会出现关键错误或数据不是必需的。包含所需数据的一个块如下
{"raceId":"1808334","position":"6","trap":"6","resultHandicap":"","name":"Turkey
Blaze","dogSex":"D","dogDateOfBirth":"2017-05-01 05:00","dogSire":"SIDARIAN BLAZE","dogDam":"MISS PRECEDENT","msgTimeOff":"2021-01-05
13:49","status":"6","reservename":"","dogId":"532771","reserveDogId":"","comment":"wide, crowded
run-up and first","withdrawreason":"Wide,CrdRnUp&1","calcRTimeS":29.82,"dogColor":"bk","fract":"10\/1","trainer
":"A Jenkins","favFlag":"","rpDistDesc":"1 1\/4","splitTime":"4.68","winnersTimeS":"29.48","raceStatus":"P","rStatusCde":"P","finalROutcomeId":
"6","reserveYn":"","isNonRunner":"0","isReserved":"0","videoid":""}
它包含在一个列表中,并且有很多列表可用。我想提取所有这些
我用来获得 json 响应的代码是
class MySpider(scrapy.Spider):
name = "timeline"
def __init__(self,date='', *args,**kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.date = date
self.start_urls = ['https://greyhoundbet.racingpost.com/results/blocks.sd?race_id=1808334&track_id=4&r_date='+ date +'&r_time=13%3A49&blocks=meetingHeader%2Cresults-meeting-pager%2Clist']
def parse2(self, response):
jsn_data = response.json()
for datas in jsn_data['list']['forecasts']:
print(datas)
if __name__ == '__main__'
spider = 'timeline'
date = '2021-01-05'
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
process = CrawlerProcess(settings)
process.crawl(spider, date = date)
process.start()
首先获取raceIds
每个 raceId 都有一个 items ,每个 item 都有 dogId
像这样
这是JSON你得到的更好的视觉效果
json 对象有几个键和值,包括 dogId(第 14 行) 和 msgTimeOff(第 11 行)。您可以采用类似的方法处理它,就像 python 字典
所以,在parse_2方法中,
def parse2(self, response):
jsn_data = response.json()
for datas in jsn_data['list']['forecasts']:
print(datas)
dogId = datas.get('dogId') #Will return None if key not found
msgTimeOff = datas.get('msgTimeOff')
我正在尝试抓取一个网站,但该网站以 json 格式存储所需数据。我使用 spider 从 url 获取数据。我无法理解 json 响应,因为我是 scrapy 的初学者,尤其是 json 格式,我想提取 dogId 和 msgTimeOff。我随机尝试有时会出现关键错误或数据不是必需的。包含所需数据的一个块如下
{"raceId":"1808334","position":"6","trap":"6","resultHandicap":"","name":"Turkey
Blaze","dogSex":"D","dogDateOfBirth":"2017-05-01 05:00","dogSire":"SIDARIAN BLAZE","dogDam":"MISS PRECEDENT","msgTimeOff":"2021-01-05
13:49","status":"6","reservename":"","dogId":"532771","reserveDogId":"","comment":"wide, crowded
run-up and first","withdrawreason":"Wide,CrdRnUp&1","calcRTimeS":29.82,"dogColor":"bk","fract":"10\/1","trainer
":"A Jenkins","favFlag":"","rpDistDesc":"1 1\/4","splitTime":"4.68","winnersTimeS":"29.48","raceStatus":"P","rStatusCde":"P","finalROutcomeId":
"6","reserveYn":"","isNonRunner":"0","isReserved":"0","videoid":""}
它包含在一个列表中,并且有很多列表可用。我想提取所有这些 我用来获得 json 响应的代码是
class MySpider(scrapy.Spider):
name = "timeline"
def __init__(self,date='', *args,**kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.date = date
self.start_urls = ['https://greyhoundbet.racingpost.com/results/blocks.sd?race_id=1808334&track_id=4&r_date='+ date +'&r_time=13%3A49&blocks=meetingHeader%2Cresults-meeting-pager%2Clist']
def parse2(self, response):
jsn_data = response.json()
for datas in jsn_data['list']['forecasts']:
print(datas)
if __name__ == '__main__'
spider = 'timeline'
date = '2021-01-05'
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
process = CrawlerProcess(settings)
process.crawl(spider, date = date)
process.start()
首先获取raceIds
每个 raceId 都有一个 items ,每个 item 都有 dogId
像这样
这是JSON你得到的更好的视觉效果
json 对象有几个键和值,包括 dogId(第 14 行) 和 msgTimeOff(第 11 行)。您可以采用类似的方法处理它,就像 python 字典
所以,在parse_2方法中,
def parse2(self, response): jsn_data = response.json() for datas in jsn_data['list']['forecasts']: print(datas) dogId = datas.get('dogId') #Will return None if key not found msgTimeOff = datas.get('msgTimeOff')