tree.xpath 返回空列表

tree.xpath returning empty list

我正在尝试编写一个可以抓取给定网站的程序。到目前为止我有这个:

from lxml import html
import requests

page = requests.get('https://www.cruiseplum.com/search#{"numPax":2,"geo":"US","portsMatchAll":true,"numOptionsShown":20,"ppdIncludesTaxTips":true,"uiVersion":"split","sortTableByField":"dd","sortTableOrderDesc":false,"filter":null}')

tree = html.fromstring(page.content)

date = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[1]/text()')

ship = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[2]/text()')

length = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[4]/text()')

meta = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[6]/text()')

price = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[7]/text()')

print('Date: ', date)
print('Ship: ', ship)
print('Length: ', length)
print('Meta: ', meta)
print('Price: ', price)

运行时,列表 return 为空。

我对 python 和一般编码还很陌生,非常感谢你们提供的任何帮助!

谢谢

问题似乎出在您导航到的 URL 上。在浏览器中导航到 URL 会出现提示,询问您是否要恢复已添加书签的搜索。

我没有看到解决此问题的简单方法。单击 'Yes' 会导致 javascript 操作,而不是具有不同参数的实际重定向。

我建议使用像 selenium 这样的东西来完成这个。

首先,您使用的link不正确;这是正确的 link(点击按钮 'yes' 后(网站将下载数据和 return 它们到此 link)):

https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}

其次,当你使用requests获取响应对象时,table里面的内容数据是隐藏的,不是returned:

from lxml import html
import requests

u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
r = requests.get(u)
t = html.fromstring(r.content)

for i in t.xpath('//tr//text()'):
    print(i)

这将 return :

Recent update: new computer-optimized interface and new filters
Want to track your favorite cruises?
Login or sign up to get started.
Login / Sign Up
Loading...
Email status
Unverified
My favorites & alerts
Log out
Want to track your favorite cruises?
Login or sign up to get started.
Login / Sign Up
Loading...
Email status
Unverified
My favorites & alerts
Log out
Date Colors:
(vs. selected)
Lowest Price
Lower Price
Same Price
Higher Price

即使你使用requests_html,内容仍然隐藏

from requests_html import HTMLSession
session = HTMLSession()
r = session.get(u)

您将需要使用 selenium 访问隐藏的 html 内容:

from lxml import html
from selenium import webdriver
import time

u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe")
driver.get(u)

time.sleep(2)

driver.find_element_by_id('restoreSettingsYesEncl').click()
time.sleep(10) #wait until the website downoad data, without this we can't move on

elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("innerHTML")

t = html.fromstring(source_code)

for i in t.xpath('//td[@class="dc-table-column _1"]/text()'):
    print(i.strip())

driver.quit()

本return第一栏(船名):

Costa Luminosa
Navigator Of The Seas
Navigator Of The Seas
Carnival Ecstasy
Carnival Ecstasy
Carnival Ecstasy
Carnival Victory
Carnival Victory
Carnival Victory
Costa Favolosa
Costa Favolosa
Costa Favolosa
Costa Smeralda
Carnival Inspiration
Carnival Inspiration
Carnival Inspiration
Costa Smeralda
Costa Smeralda
Disney Dream
Disney Dream

如您所见,table 中的内容现在使用 get_attribute("innerHTML") of selenium

下一步是抓取行(船只、航线、天数、地区..)并将它们存储在 csv 文件(或任何其他格式)中, 然后对所有 4051 页执行此操作。