Scrapy > IndexError: list index out of range
Scrapy > IndexError: list index out of range
我正在尝试抓取 TripAdvisor 的一些数据。
我有兴趣了解餐厅的“价格范围/菜肴和膳食”。
所以我使用以下 xpath 将这 3 行中的每一行都提取到同一个 class 中:
response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()').extract()[1]
我直接在 scrapy 中进行测试 shell 并且工作正常 :
scrapy shell https://www.tripadvisor.com/Restaurant_Review-g187514-d15364769-Reviews-La_Gaditana_Castellana-Madrid.html
但是当我将它集成到我的脚本中时,出现以下错误:
Traceback (most recent call last):
File "/usr/lib64/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/Scrapy_TripAdvisor_Restaurant-master/tripadvisor_las_vegas/tripadvisor_las_vegas/spiders/res_las_vegas.py", line 64, in parse_listing
(response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
File "/usr/lib/python3.6/site-packages/parsel/selector.py", line 61, in __getitem__
o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
我把我的部分代码贴给你,我在下面解释:
# extract restaurant cuisine
row_cuisine_overviewcard = \
(response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
row_cuisine_card = \
(response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
elif (row_cuisine_card == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
else:
cuisine = None
在 tripAdvisor 餐厅中,有 2 种不同类型的页面,有 2 种不同的格式。
第一个带有 class 概览卡片,第二个带有 class 个卡片
所以我想检查第一个是否存在(overviewcard),如果不存在,则执行第二个(card),如果不存在,则输入“None”值。
:D 但是看起来 Python 都执行了....并且由于页面中不存在第二个,脚本停止。
会不会是缩进错误?
感谢您的帮助
此致
您的第二个选择器 (row_cuisine_card
) 失败,因为页面上不存在该元素。当您随后尝试访问结果中的 [1]
时,它会抛出错误,因为结果数组为空。
假设您真的想要商品 1
,试试这个
row_cuisine_overviewcard = \
(response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
# Here we get all the values, even if it is empty.
row_cuisine_card = \
(response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()').getall())
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
# Here we check first if that result has more than 1 item, and then we check the value.
elif (len(row_cuisine_card) > 1 and row_cuisine_card[1] == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
else:
cuisine = None
每当您尝试从选择器获取特定索引时,您都应该应用相同类型的安全检查。换句话说,在访问它之前确保你有一个值。
你的问题已经在你检查这一行了_
row_cuisine_card = \
(response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
您正试图从网站中提取可能不存在的值。也就是说,如果
response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')
returns 没有或只有一个元素,则您无法访问返回列表中的第二个元素(您希望使用附加的 [1]
访问)。
我建议先将您从网站中提取的值存储到局部变量中,以便随后检查是否找到了您想要的值。我的猜测是它中断的页面没有你想要的信息。
这大概类似于以下代码:
# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2:
row_cuisine_overviewcard = cuisine_overviewcard_sections[1]
cuisine_card_sections = response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
if len(cuisine_card_sections) >= 2:
row_cuisine_card = cuisine_card_sections[1]
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
elif (row_cuisine_card == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
由于只需要一部分信息,如果第一个XPath检查已经returns正确答案,代码可以美化一下:
# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2 and cuisine_overviewcard_sections[1] == "CUISINES":
cuisine = \
response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
else:
cuisine_card_sections = response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
if len(cuisine_card_sections) >= 2 and cuisine_card_sections[1] == "CUISINES":
cuisine = \
response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
这样您只在实际需要时才进行(可能很昂贵的)XPath 搜索。
我正在尝试抓取 TripAdvisor 的一些数据。 我有兴趣了解餐厅的“价格范围/菜肴和膳食”。
所以我使用以下 xpath 将这 3 行中的每一行都提取到同一个 class 中:
response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()').extract()[1]
我直接在 scrapy 中进行测试 shell 并且工作正常 :
scrapy shell https://www.tripadvisor.com/Restaurant_Review-g187514-d15364769-Reviews-La_Gaditana_Castellana-Madrid.html
但是当我将它集成到我的脚本中时,出现以下错误:
Traceback (most recent call last):
File "/usr/lib64/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/Scrapy_TripAdvisor_Restaurant-master/tripadvisor_las_vegas/tripadvisor_las_vegas/spiders/res_las_vegas.py", line 64, in parse_listing
(response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
File "/usr/lib/python3.6/site-packages/parsel/selector.py", line 61, in __getitem__
o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
我把我的部分代码贴给你,我在下面解释:
# extract restaurant cuisine
row_cuisine_overviewcard = \
(response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
row_cuisine_card = \
(response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
elif (row_cuisine_card == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
else:
cuisine = None
在 tripAdvisor 餐厅中,有 2 种不同类型的页面,有 2 种不同的格式。 第一个带有 class 概览卡片,第二个带有 class 个卡片
所以我想检查第一个是否存在(overviewcard),如果不存在,则执行第二个(card),如果不存在,则输入“None”值。
:D 但是看起来 Python 都执行了....并且由于页面中不存在第二个,脚本停止。
会不会是缩进错误?
感谢您的帮助 此致
您的第二个选择器 (row_cuisine_card
) 失败,因为页面上不存在该元素。当您随后尝试访问结果中的 [1]
时,它会抛出错误,因为结果数组为空。
假设您真的想要商品 1
,试试这个
row_cuisine_overviewcard = \
(response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
# Here we get all the values, even if it is empty.
row_cuisine_card = \
(response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()').getall())
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
# Here we check first if that result has more than 1 item, and then we check the value.
elif (len(row_cuisine_card) > 1 and row_cuisine_card[1] == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
else:
cuisine = None
每当您尝试从选择器获取特定索引时,您都应该应用相同类型的安全检查。换句话说,在访问它之前确保你有一个值。
你的问题已经在你检查这一行了_
row_cuisine_card = \
(response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
您正试图从网站中提取可能不存在的值。也就是说,如果
response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')
returns 没有或只有一个元素,则您无法访问返回列表中的第二个元素(您希望使用附加的 [1]
访问)。
我建议先将您从网站中提取的值存储到局部变量中,以便随后检查是否找到了您想要的值。我的猜测是它中断的页面没有你想要的信息。
这大概类似于以下代码:
# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2:
row_cuisine_overviewcard = cuisine_overviewcard_sections[1]
cuisine_card_sections = response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
if len(cuisine_card_sections) >= 2:
row_cuisine_card = cuisine_card_sections[1]
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
elif (row_cuisine_card == "CUISINES"):
cuisine = \
response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
由于只需要一部分信息,如果第一个XPath检查已经returns正确答案,代码可以美化一下:
# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2 and cuisine_overviewcard_sections[1] == "CUISINES":
cuisine = \
response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
else:
cuisine_card_sections = response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
if len(cuisine_card_sections) >= 2 and cuisine_card_sections[1] == "CUISINES":
cuisine = \
response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
这样您只在实际需要时才进行(可能很昂贵的)XPath 搜索。