Yield Request调用在scrapy的递归方法中产生奇怪的结果
Yield Request call produce weird result in recursive method with scrapy
我正在尝试使用 Python 和 Scrapy 在一天内从所有国家/地区的所有机场取消所有出发和到达。
这个著名网站(飞行雷达)使用的JSON数据库需要在一个机场出发或到达> 100时逐页查询。我还根据查询的实际日期 UTC 计算时间戳。
我尝试使用此层次结构创建数据库:
country 1
- airport 1
- departures
- page 1
- page ...
- arrivals
- page 1
- page ...
- airport 2
- departures
- page 1
- page ...
- arrivals
- page
- page ...
...
我使用两种方法来计算时间戳和 url 按页面查询:
def compute_timestamp(self):
from datetime import datetime, date
import calendar
# +/- 24 heures
d = date(2017, 4, 27)
timestamp = calendar.timegm(d.timetuple())
return timestamp
def build_api_call(self,code,page,timestamp):
return 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page={page}&limit=100&token='.format(
code=code, page=page, timestamp=timestamp)
我将结果存储到 CountryItem
,其中包含很多 AirportItem
到机场。我的 item.py
是:
class CountryItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
num_airports = scrapy.Field()
airports = scrapy.Field()
other_url= scrapy.Field()
last_updated = scrapy.Field(serializer=str)
class AirportItem(scrapy.Item):
name = scrapy.Field()
code_little = scrapy.Field()
code_total = scrapy.Field()
lat = scrapy.Field()
lon = scrapy.Field()
link = scrapy.Field()
departures = scrapy.Field()
arrivals = scrapy.Field()
我的主要解析为所有国家构建了一个国家项目(例如,我在这里限制为以色列)。接下来,我为每个国家/地区提供一个 scrapy.Request
来抓取机场。
###################################
# MAIN PARSE
####################################
def parse(self, response):
count_country = 0
countries = []
for country in response.xpath('//a[@data-country]'):
item = CountryItem()
url = country.xpath('./@href').extract()
name = country.xpath('./@title').extract()
item['link'] = url[0]
item['name'] = name[0]
item['airports'] = []
count_country += 1
if name[0] == "Israel":
countries.append(item)
self.logger.info("Country name : %s with link %s" , item['name'] , item['link'])
yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)
此方法为每个机场抓取信息,并为每个机场调用 scrapy.request
和机场 url 抓取出发和到达:
###################################
# PARSE EACH AIRPORT
####################################
def parse_airports(self, response):
item = response.meta['my_country_item']
item['airports'] = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = AirportItem()
iAirport['name'] = self.clean_html(name)
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
item['airports'].append(iAirport)
urls = []
for airport in item['airports']:
json_url = self.build_api_call(airport['code_little'], 1, self.compute_timestamp())
urls.append(json_url)
if not urls:
return item
# start with first url
next_url = urls.pop()
return scrapy.Request(next_url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': 0})
使用递归方法 parse_schedule
我将每个机场添加到国家/地区项目。 SO 成员已经 在这一点上。
###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(self, response):
"""we want to loop this continuously to build every departure and arrivals requests"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']
urls_departures, urls_arrivals = self.compute_urls_by_page(response, item['airports'][i]['name'], item['airports'][i]['code_little'])
print("urls_departures = ", len(urls_departures))
print("urls_arrivals = ", len(urls_arrivals))
## YIELD NOT CALLED
yield scrapy.Request(response.url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': urls_departures, 'i':0 , 'p': 0}, dont_filter=True)
# now do next schedule items
if not urls:
yield item
return
url = urls.pop()
yield scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
self.compute_urls_by_page
方法计算正确的 URL 以检索一个机场的所有出发和到达。
###################################
# PARSE EACH DEPARTURES / ARRIVALS
###################################
def parse_departures_page(self, response):
item = response.meta['airport_item']
p = response.meta['p']
i = response.meta['i']
page_urls = response.meta['page_urls']
print("PAGE URL = ", page_urls)
if not page_urls:
yield item
return
page_url = page_urls.pop()
print("GET PAGE FOR ", item['airports'][i]['name'], ">> ", p)
jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule.departures.data")
item['airports'][i]['departures'] = json_expression.search(jsonload)
yield scrapy.Request(page_url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': page_urls, 'i': i, 'p': p + 1})
接下来,通常调用 self.parse_departure_page
递归方法的 parse_schedule
中的第一个 yield 会产生奇怪的结果。 Scrapy 调用了这个方法,但我只收集了一个机场的出发页面我不明白为什么...我的请求或 yield 源代码中可能有一个订购错误,所以也许你可以帮我找出答案。
完整代码在GitHubhttps://github.com/IDEES-Rouen/Flight-Scrapping/tree/master/flight/flight_project
您可以 运行 使用 scrapy cawl airports
命令。
更新 1:
我尝试使用 yield from
单独回答问题,但没有成功,因为您可以在底部看到答案...所以如果您有想法?
是的,我终于在 SO 上找到了答案 here ...
当你使用递归yield
时,你需要使用yield from
。这里有一个简化的例子:
airport_list = ["airport1", "airport2", "airport3", "airport4"]
def parse_page_departure(airport, next_url, page_urls):
print(airport, " / ", next_url)
if not page_urls:
return
next_url = page_urls.pop()
yield from parse_page_departure(airport, next_url, page_urls)
###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(next_airport, airport_list):
## GET EACH DEPARTURE PAGE
departures_list = ["p1", "p2", "p3", "p4"]
next_departure_url = departures_list.pop()
yield parse_page_departure(next_airport,next_departure_url, departures_list)
if not airport_list:
print("no new airport")
return
next_airport_url = airport_list.pop()
yield from parse_schedule(next_airport_url, airport_list)
next_airport_url = airport_list.pop()
result = parse_schedule(next_airport_url, airport_list)
for i in result:
print(i)
for d in i:
print(d)
更新,不要使用真正的程序:
我尝试重现相同的 yield from
模式 with the real program here,但我在 scrapy.Request
上使用它时出错,不明白为什么...
这里是 python 回溯:
Traceback (most recent call last):
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/reyman/Projets/Flight-Scrapping/flight/flight_project/spiders/AirportsSpider.py", line 209, in parse_schedule
yield from scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
TypeError: 'Request' object is not iterable
2017-06-27 17:40:50 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-27 17:40:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
Comment: ... not totally clear ... you call AirportData(response, 1) ... also a little typo here : self.pprint(schedule)
我用 class AirportData
来实现(限制为 2 个页面和 2 个航班)。
更新了我的代码,删除了 class AirportData
并添加了 class Page
.
现在应该满足所有依赖项。
这不是错字,self.pprint(...
是class AirportsSpider Method
用于Pretty Printing对象,就像最后显示的输出一样。我已增强 class Schedule
以显示基本用法。
Comment: What is AirportData in your answer ?
编辑:class AirportData
已删除。
如 # ENDPOINT
所述,Page object
飞行数据分为 page.arrivals
和 page.departures
。
(限于 2 页和 2 个航班)
Page = [Flight 1, Flight 1, ... Flight n]
schedule.airport['arrivals'] == [Page 1, Page 2, ..., Page n]
schedule.airport['departures'] == [Page 1, Page 2, ..., Page n]
Comment: ... we have multiples pages which contains multiples departures/arrivals.
是的,在第一次回答时,我没有任何 api json
进一步的回应。
现在我得到了 api json
的响应,但没有反映 current date
给定的 timestamp
、returns。
api params
看起来不常见,你对描述有 link 吗?
尽管如此,请考虑以下简化方法:
# 包含一页 Arrivals/Departures 航班数据的页面对象
class Page(object):
def __init__(self, title, schedule):
# schedule includes ['arrivals'] or ['departures]
self.current = schedule['page']['current']
self.total = schedule['page']['total']
self.header = '{}:page:{} item:{}'.format(title, schedule['page'], schedule['item'])
self.flight = []
for data in schedule['data']:
self.flight.append(data['flight'])
def __iter__(self):
yield from self.flight
# Schedule object holding a Airport all Pages
class Schedule(object):
def __init__(self):
self.country = None
self.airport = None
def __str__(self):
arrivals = self.airport['arrivals'][0]
departures = self.airport['departures'][0]
return '{}\n\t{}\n\t\t{}\n\t\t\t{}\n\t\t{}\n\t\t\t{}'. \
format(self.country['name'],
self.airport['name'],
arrivals.header,
arrivals.flight[0]['airline']['name'],
departures.header,
departures.flight[0]['airline']['name'], )
# 解析每个国家/地区的机场
def parse_schedule(self, response):
meta = response.meta
if 'airport' in meta:
# First call from parse_airports
schedule = Schedule()
schedule.country = response.meta['country']
schedule.airport = response.meta['airport']
else:
schedule = response.meta['schedule']
data = json.loads(response.body_as_unicode())
airport = data['result']['response']['airport']
schedule.airport['arrivals'].append(Page('Arrivals', airport['pluginData']['schedule']['arrivals']))
schedule.airport['departures'].append(Page('Departures', airport['pluginData']['schedule']['departures']))
page = schedule.airport['departures'][-1]
if page.current < page.total:
json_url = self.build_api_call(schedule.airport['code_little'], page.current + 1, self.compute_timestamp())
yield scrapy.Request(json_url, meta={'schedule': schedule}, callback=self.parse_schedule)
else:
# ENDPOINT Schedule object holding one Airport.
# schedule.airport['arrivals'] and schedule.airport['departures'] ==
# List of Page with List of Flight Data
print(schedule)
# 解析每个机场
def parse_airports(self, response):
country = response.meta['country']
for airport in response.xpath('//a[@data-iata]'):
name = ''.join(airport.xpath('./text()').extract()[0]).strip()
if 'Charles' in name:
meta = response.meta
meta['airport'] = AirportItem()
meta['airport']['name'] = name
meta['airport']['link'] = airport.xpath('./@href').extract()[0]
meta['airport']['lat'] = airport.xpath("./@data-lat").extract()[0]
meta['airport']['lon'] = airport.xpath("./@data-lon").extract()[0]
meta['airport']['code_little'] = airport.xpath('./@data-iata').extract()[0]
meta['airport']['code_total'] = airport.xpath('./small/text()').extract()[0]
json_url = self.build_api_call(meta['airport']['code_little'], 1, self.compute_timestamp())
yield scrapy.Request(json_url, meta=meta, callback=self.parse_schedule)
# 主要分析
Note: response.xpath('//a[@data-country]')
returns all Countrys two times!
def parse(self, response):
for a_country in response.xpath('//a[@data-country]'):
name = a_country.xpath('./@title').extract()[0]
if name == "France":
country = CountryItem()
country['name'] = name
country['link'] = a_country.xpath('./@href').extract()[0]
yield scrapy.Request(country['link'],
meta={'country': country},
callback=self.parse_airports)
Qutput: Shorten to 2 Pages and 2 Flights per Page
France
Paris Charles de Gaulle Airport
Departures:(page=(1, 1, 7)) 2017-07-02 21:28:00 page:{'current': 1, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 696}
21:30 PM AF1558 Newcastle Airport (NCL) Air France ARJ Estimated dep 21:30
21:30 PM VY8833 Seville San Pablo Airport (SVQ) Vueling 320 Estimated dep 21:30
... (omitted for brevity)
Departures:(page=(2, 2, 7)) 2017-07-02 21:28:00 page:{'current': 2, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 696}
07:30 AM AF1680 London Heathrow Airport (LHR) Air France 789 Scheduled
07:30 AM SN3628 Brussels Airport (BRU) Brussels Airlines 733 Scheduled
... (omitted for brevity)
Arrivals:(page=(1, 1, 7)) 2017-07-02 21:28:00 page:{'current': 1, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 693}
16:30 PM LY325 Tel Aviv Ben Gurion International Airport (TLV) El Al Israel Airlines B739 Estimated 21:29
18:30 PM AY877 Helsinki Vantaa Airport (HEL) Finnair E190 Landed 21:21
... (omitted for brevity)
Arrivals:(page=(2, 2, 7)) 2017-07-02 21:28:00 page:{'current': 2, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 693}
00:15 AM AF982 Douala International Airport (DLA) Air France 772 Scheduled
23:15 PM AA44 New York John F. Kennedy International Airport (JFK) American Airlines B763 Scheduled
... (omitted for brevity)
使用 Python 测试:3.4.2 - Scrapy 1.4.0
我尝试在本地克隆并进行更好的调查,但是当它到达出发解析时我遇到了一些 ConnectionRefused 错误,所以我不确定我提出的答案是否会修复它,无论如何:
###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(self, response):
"""we want to loop this continuously to build every departure and arrivals requests"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']
urls_departures, urls_arrivals = self.compute_urls_by_page(response, item['airports'][i]['name'], item['airports'][i]['code_little'])
if 'urls_departures' in response.meta:
urls_departures += response.meta["urls_departures"]
if 'urls_arrivals' in response.meta:
urls_arrivals += response.meta["urls_arrivals"]
print("urls_departures = ", len(urls_departures))
print("urls_arrivals = ", len(urls_arrivals))
item['airports'][i]['departures'] = []
# now do next schedule items
if not urls:
yield scrapy.Request(urls_departures.pop(), self.parse_departures_page, meta={'airport_item': item, 'page_urls': urls_departures, 'i':i , 'p': 0}, dont_filter=True)
else:
url = urls.pop()
yield scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1, 'urls_departures': urls_departures, 'urls_arrivals': urls_arrivals})
###################################
# PARSE EACH DEPARTURES / ARRIVALS
###################################
def parse_departures_page(self, response):
item = response.meta['airport_item']
p = response.meta['p']
i = response.meta['i']
page_urls = response.meta['page_urls']
jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule.departures.data")
# Append a new page
item['airports'][i]['departures'].append(json_expression.search(jsonload))
if len(page_urls) > 0:
page_url = page_urls.pop()
yield scrapy.Request(page_url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': page_urls, 'i': i, 'p': p + 1}, dont_filter=True)
else:
yield item
但基本上这些都是你的错误:
在您的 parse_schedule 和您的 parse_departures_page 中,您有条件获得最终物品;
您将错误的 url 传递给了 parse_departures_page;
你需要 dont_filter=True on parse_departures_page;
您正试图保持大量循环以将更多信息解析到同一对象
我提议的更改将跟踪此机场上的所有 urls_departures,这样您就可以在 parse_departures_page 上进行迭代并解决您的问题。
即使这解决了你的问题,我真的建议你改变你的数据结构,这样你就可以有多个出发项目,并且能够更有效地提取这些信息。
我正在尝试使用 Python 和 Scrapy 在一天内从所有国家/地区的所有机场取消所有出发和到达。
这个著名网站(飞行雷达)使用的JSON数据库需要在一个机场出发或到达> 100时逐页查询。我还根据查询的实际日期 UTC 计算时间戳。
我尝试使用此层次结构创建数据库:
country 1
- airport 1
- departures
- page 1
- page ...
- arrivals
- page 1
- page ...
- airport 2
- departures
- page 1
- page ...
- arrivals
- page
- page ...
...
我使用两种方法来计算时间戳和 url 按页面查询:
def compute_timestamp(self):
from datetime import datetime, date
import calendar
# +/- 24 heures
d = date(2017, 4, 27)
timestamp = calendar.timegm(d.timetuple())
return timestamp
def build_api_call(self,code,page,timestamp):
return 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page={page}&limit=100&token='.format(
code=code, page=page, timestamp=timestamp)
我将结果存储到 CountryItem
,其中包含很多 AirportItem
到机场。我的 item.py
是:
class CountryItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
num_airports = scrapy.Field()
airports = scrapy.Field()
other_url= scrapy.Field()
last_updated = scrapy.Field(serializer=str)
class AirportItem(scrapy.Item):
name = scrapy.Field()
code_little = scrapy.Field()
code_total = scrapy.Field()
lat = scrapy.Field()
lon = scrapy.Field()
link = scrapy.Field()
departures = scrapy.Field()
arrivals = scrapy.Field()
我的主要解析为所有国家构建了一个国家项目(例如,我在这里限制为以色列)。接下来,我为每个国家/地区提供一个 scrapy.Request
来抓取机场。
###################################
# MAIN PARSE
####################################
def parse(self, response):
count_country = 0
countries = []
for country in response.xpath('//a[@data-country]'):
item = CountryItem()
url = country.xpath('./@href').extract()
name = country.xpath('./@title').extract()
item['link'] = url[0]
item['name'] = name[0]
item['airports'] = []
count_country += 1
if name[0] == "Israel":
countries.append(item)
self.logger.info("Country name : %s with link %s" , item['name'] , item['link'])
yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)
此方法为每个机场抓取信息,并为每个机场调用 scrapy.request
和机场 url 抓取出发和到达:
###################################
# PARSE EACH AIRPORT
####################################
def parse_airports(self, response):
item = response.meta['my_country_item']
item['airports'] = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = AirportItem()
iAirport['name'] = self.clean_html(name)
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
item['airports'].append(iAirport)
urls = []
for airport in item['airports']:
json_url = self.build_api_call(airport['code_little'], 1, self.compute_timestamp())
urls.append(json_url)
if not urls:
return item
# start with first url
next_url = urls.pop()
return scrapy.Request(next_url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': 0})
使用递归方法 parse_schedule
我将每个机场添加到国家/地区项目。 SO 成员已经
###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(self, response):
"""we want to loop this continuously to build every departure and arrivals requests"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']
urls_departures, urls_arrivals = self.compute_urls_by_page(response, item['airports'][i]['name'], item['airports'][i]['code_little'])
print("urls_departures = ", len(urls_departures))
print("urls_arrivals = ", len(urls_arrivals))
## YIELD NOT CALLED
yield scrapy.Request(response.url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': urls_departures, 'i':0 , 'p': 0}, dont_filter=True)
# now do next schedule items
if not urls:
yield item
return
url = urls.pop()
yield scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
self.compute_urls_by_page
方法计算正确的 URL 以检索一个机场的所有出发和到达。
###################################
# PARSE EACH DEPARTURES / ARRIVALS
###################################
def parse_departures_page(self, response):
item = response.meta['airport_item']
p = response.meta['p']
i = response.meta['i']
page_urls = response.meta['page_urls']
print("PAGE URL = ", page_urls)
if not page_urls:
yield item
return
page_url = page_urls.pop()
print("GET PAGE FOR ", item['airports'][i]['name'], ">> ", p)
jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule.departures.data")
item['airports'][i]['departures'] = json_expression.search(jsonload)
yield scrapy.Request(page_url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': page_urls, 'i': i, 'p': p + 1})
接下来,通常调用 self.parse_departure_page
递归方法的 parse_schedule
中的第一个 yield 会产生奇怪的结果。 Scrapy 调用了这个方法,但我只收集了一个机场的出发页面我不明白为什么...我的请求或 yield 源代码中可能有一个订购错误,所以也许你可以帮我找出答案。
完整代码在GitHubhttps://github.com/IDEES-Rouen/Flight-Scrapping/tree/master/flight/flight_project
您可以 运行 使用 scrapy cawl airports
命令。
更新 1:
我尝试使用 yield from
单独回答问题,但没有成功,因为您可以在底部看到答案...所以如果您有想法?
是的,我终于在 SO 上找到了答案 here ...
当你使用递归yield
时,你需要使用yield from
。这里有一个简化的例子:
airport_list = ["airport1", "airport2", "airport3", "airport4"]
def parse_page_departure(airport, next_url, page_urls):
print(airport, " / ", next_url)
if not page_urls:
return
next_url = page_urls.pop()
yield from parse_page_departure(airport, next_url, page_urls)
###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(next_airport, airport_list):
## GET EACH DEPARTURE PAGE
departures_list = ["p1", "p2", "p3", "p4"]
next_departure_url = departures_list.pop()
yield parse_page_departure(next_airport,next_departure_url, departures_list)
if not airport_list:
print("no new airport")
return
next_airport_url = airport_list.pop()
yield from parse_schedule(next_airport_url, airport_list)
next_airport_url = airport_list.pop()
result = parse_schedule(next_airport_url, airport_list)
for i in result:
print(i)
for d in i:
print(d)
更新,不要使用真正的程序:
我尝试重现相同的 yield from
模式 with the real program here,但我在 scrapy.Request
上使用它时出错,不明白为什么...
这里是 python 回溯:
Traceback (most recent call last):
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/reyman/.pyenv/versions/venv352/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/reyman/Projets/Flight-Scrapping/flight/flight_project/spiders/AirportsSpider.py", line 209, in parse_schedule
yield from scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
TypeError: 'Request' object is not iterable
2017-06-27 17:40:50 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-27 17:40:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
Comment: ... not totally clear ... you call AirportData(response, 1) ... also a little typo here : self.pprint(schedule)
我用 class AirportData
来实现(限制为 2 个页面和 2 个航班)。
更新了我的代码,删除了 class AirportData
并添加了 class Page
.
现在应该满足所有依赖项。
这不是错字,self.pprint(...
是class AirportsSpider Method
用于Pretty Printing对象,就像最后显示的输出一样。我已增强 class Schedule
以显示基本用法。
Comment: What is AirportData in your answer ?
编辑:class AirportData
已删除。
如 # ENDPOINT
所述,Page object
飞行数据分为 page.arrivals
和 page.departures
。
(限于 2 页和 2 个航班)
Page = [Flight 1, Flight 1, ... Flight n] schedule.airport['arrivals'] == [Page 1, Page 2, ..., Page n] schedule.airport['departures'] == [Page 1, Page 2, ..., Page n]
Comment: ... we have multiples pages which contains multiples departures/arrivals.
是的,在第一次回答时,我没有任何 api json
进一步的回应。
现在我得到了 api json
的响应,但没有反映 current date
给定的 timestamp
、returns。
api params
看起来不常见,你对描述有 link 吗?
尽管如此,请考虑以下简化方法:
# 包含一页 Arrivals/Departures 航班数据的页面对象
class Page(object):
def __init__(self, title, schedule):
# schedule includes ['arrivals'] or ['departures]
self.current = schedule['page']['current']
self.total = schedule['page']['total']
self.header = '{}:page:{} item:{}'.format(title, schedule['page'], schedule['item'])
self.flight = []
for data in schedule['data']:
self.flight.append(data['flight'])
def __iter__(self):
yield from self.flight
# Schedule object holding a Airport all Pages
class Schedule(object):
def __init__(self):
self.country = None
self.airport = None
def __str__(self):
arrivals = self.airport['arrivals'][0]
departures = self.airport['departures'][0]
return '{}\n\t{}\n\t\t{}\n\t\t\t{}\n\t\t{}\n\t\t\t{}'. \
format(self.country['name'],
self.airport['name'],
arrivals.header,
arrivals.flight[0]['airline']['name'],
departures.header,
departures.flight[0]['airline']['name'], )
# 解析每个国家/地区的机场
def parse_schedule(self, response):
meta = response.meta
if 'airport' in meta:
# First call from parse_airports
schedule = Schedule()
schedule.country = response.meta['country']
schedule.airport = response.meta['airport']
else:
schedule = response.meta['schedule']
data = json.loads(response.body_as_unicode())
airport = data['result']['response']['airport']
schedule.airport['arrivals'].append(Page('Arrivals', airport['pluginData']['schedule']['arrivals']))
schedule.airport['departures'].append(Page('Departures', airport['pluginData']['schedule']['departures']))
page = schedule.airport['departures'][-1]
if page.current < page.total:
json_url = self.build_api_call(schedule.airport['code_little'], page.current + 1, self.compute_timestamp())
yield scrapy.Request(json_url, meta={'schedule': schedule}, callback=self.parse_schedule)
else:
# ENDPOINT Schedule object holding one Airport.
# schedule.airport['arrivals'] and schedule.airport['departures'] ==
# List of Page with List of Flight Data
print(schedule)
# 解析每个机场
def parse_airports(self, response):
country = response.meta['country']
for airport in response.xpath('//a[@data-iata]'):
name = ''.join(airport.xpath('./text()').extract()[0]).strip()
if 'Charles' in name:
meta = response.meta
meta['airport'] = AirportItem()
meta['airport']['name'] = name
meta['airport']['link'] = airport.xpath('./@href').extract()[0]
meta['airport']['lat'] = airport.xpath("./@data-lat").extract()[0]
meta['airport']['lon'] = airport.xpath("./@data-lon").extract()[0]
meta['airport']['code_little'] = airport.xpath('./@data-iata').extract()[0]
meta['airport']['code_total'] = airport.xpath('./small/text()').extract()[0]
json_url = self.build_api_call(meta['airport']['code_little'], 1, self.compute_timestamp())
yield scrapy.Request(json_url, meta=meta, callback=self.parse_schedule)
# 主要分析
Note:
response.xpath('//a[@data-country]')
returns all Countrys two times!
def parse(self, response):
for a_country in response.xpath('//a[@data-country]'):
name = a_country.xpath('./@title').extract()[0]
if name == "France":
country = CountryItem()
country['name'] = name
country['link'] = a_country.xpath('./@href').extract()[0]
yield scrapy.Request(country['link'],
meta={'country': country},
callback=self.parse_airports)
Qutput: Shorten to 2 Pages and 2 Flights per Page
France Paris Charles de Gaulle Airport Departures:(page=(1, 1, 7)) 2017-07-02 21:28:00 page:{'current': 1, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 696} 21:30 PM AF1558 Newcastle Airport (NCL) Air France ARJ Estimated dep 21:30 21:30 PM VY8833 Seville San Pablo Airport (SVQ) Vueling 320 Estimated dep 21:30 ... (omitted for brevity) Departures:(page=(2, 2, 7)) 2017-07-02 21:28:00 page:{'current': 2, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 696} 07:30 AM AF1680 London Heathrow Airport (LHR) Air France 789 Scheduled 07:30 AM SN3628 Brussels Airport (BRU) Brussels Airlines 733 Scheduled ... (omitted for brevity) Arrivals:(page=(1, 1, 7)) 2017-07-02 21:28:00 page:{'current': 1, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 693} 16:30 PM LY325 Tel Aviv Ben Gurion International Airport (TLV) El Al Israel Airlines B739 Estimated 21:29 18:30 PM AY877 Helsinki Vantaa Airport (HEL) Finnair E190 Landed 21:21 ... (omitted for brevity) Arrivals:(page=(2, 2, 7)) 2017-07-02 21:28:00 page:{'current': 2, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 693} 00:15 AM AF982 Douala International Airport (DLA) Air France 772 Scheduled 23:15 PM AA44 New York John F. Kennedy International Airport (JFK) American Airlines B763 Scheduled ... (omitted for brevity)
使用 Python 测试:3.4.2 - Scrapy 1.4.0
我尝试在本地克隆并进行更好的调查,但是当它到达出发解析时我遇到了一些 ConnectionRefused 错误,所以我不确定我提出的答案是否会修复它,无论如何:
###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(self, response):
"""we want to loop this continuously to build every departure and arrivals requests"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']
urls_departures, urls_arrivals = self.compute_urls_by_page(response, item['airports'][i]['name'], item['airports'][i]['code_little'])
if 'urls_departures' in response.meta:
urls_departures += response.meta["urls_departures"]
if 'urls_arrivals' in response.meta:
urls_arrivals += response.meta["urls_arrivals"]
print("urls_departures = ", len(urls_departures))
print("urls_arrivals = ", len(urls_arrivals))
item['airports'][i]['departures'] = []
# now do next schedule items
if not urls:
yield scrapy.Request(urls_departures.pop(), self.parse_departures_page, meta={'airport_item': item, 'page_urls': urls_departures, 'i':i , 'p': 0}, dont_filter=True)
else:
url = urls.pop()
yield scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1, 'urls_departures': urls_departures, 'urls_arrivals': urls_arrivals})
###################################
# PARSE EACH DEPARTURES / ARRIVALS
###################################
def parse_departures_page(self, response):
item = response.meta['airport_item']
p = response.meta['p']
i = response.meta['i']
page_urls = response.meta['page_urls']
jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule.departures.data")
# Append a new page
item['airports'][i]['departures'].append(json_expression.search(jsonload))
if len(page_urls) > 0:
page_url = page_urls.pop()
yield scrapy.Request(page_url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': page_urls, 'i': i, 'p': p + 1}, dont_filter=True)
else:
yield item
但基本上这些都是你的错误:
在您的 parse_schedule 和您的 parse_departures_page 中,您有条件获得最终物品;
您将错误的 url 传递给了 parse_departures_page;
你需要 dont_filter=True on parse_departures_page;
您正试图保持大量循环以将更多信息解析到同一对象
我提议的更改将跟踪此机场上的所有 urls_departures,这样您就可以在 parse_departures_page 上进行迭代并解决您的问题。
即使这解决了你的问题,我真的建议你改变你的数据结构,这样你就可以有多个出发项目,并且能够更有效地提取这些信息。