浏览器加载项生成的合法 Xpath 查询不适用于 urllib2 获取的页面

Question

我想从 this 页面中提取每个指令 ID：

import lxml.html as lh
url ='https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib2.urlopen(url)
content = response.read()
root = lh.fromstring(content)
all_instruction_ids = root.xpath(XPATH_ALL_INSTRUCTION_IDS)

我尝试了 Chrome 和 Firebug 的开发人员工具 Firebug 以及其他浏览器插件提供的无数 XPath 表达式：

XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a/.'
#XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a/text()'
XPATH_ALL_INSTRUCTION_IDS  = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a[contains(normalize-space(), "")]'
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a'
XPATH_ALL_INSTRUCTION_IDS = ".//*[@id='content']/div/div/div[2]/table/tbody/tr[2]/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS  = "//form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS  = "id('content')/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS  = "/html/body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "//html//body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]//a"
XPATH_ALL_INSTRUCTION_IDS = "//html//body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/*/a"

然而，其中 none 在传递给 lxml.html.fromstring()

的 xpath() 方法时有效

Answer 1

// xpath 运算符不要求您从文档的顶部开始。

XPATH_ALL_INSTRUCTION_IDS = '//font/a'

我建议您查看 xpath cheatsheet。

Answer 2

我会在 href:

中找到所有包含 reference.nsf/links 的链接

//table//a[contains(@href, 'reference.nsf/links')]/text()

适合我。

浏览器加载项生成的合法 Xpath 查询不适用于 urllib2 获取的页面

Legitimate Xpath queries generated by browser add-ons not working against urllib2-fetched page

html

python

xpath

lxml

urllib2