浏览器加载项生成的合法 Xpath 查询不适用于 urllib2 获取的页面
Legitimate Xpath queries generated by browser add-ons not working against urllib2-fetched page
我想从 this 页面中提取每个指令 ID:
import lxml.html as lh
url ='https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib2.urlopen(url)
content = response.read()
root = lh.fromstring(content)
all_instruction_ids = root.xpath(XPATH_ALL_INSTRUCTION_IDS)
我尝试了 Chrome 和 Firebug 的开发人员工具 Firebug 以及其他浏览器插件提供的无数 XPath 表达式:
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a/.'
#XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a/text()'
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a[contains(normalize-space(), "")]'
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a'
XPATH_ALL_INSTRUCTION_IDS = ".//*[@id='content']/div/div/div[2]/table/tbody/tr[2]/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "//form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "id('content')/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "/html/body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "//html//body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]//a"
XPATH_ALL_INSTRUCTION_IDS = "//html//body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/*/a"
然而,其中 none 在传递给 lxml.html.fromstring()
的 xpath()
方法时有效
//
xpath 运算符不要求您从文档的顶部开始。
XPATH_ALL_INSTRUCTION_IDS = '//font/a'
我建议您查看 xpath cheatsheet。
我会在 href
:
中找到所有包含 reference.nsf/links
的链接
//table//a[contains(@href, 'reference.nsf/links')]/text()
适合我。
我想从 this 页面中提取每个指令 ID:
import lxml.html as lh
url ='https://secure.ssa.gov/apps10/reference.nsf/instructiontypecode!openview&restricttocategory=POMT'
response = urllib2.urlopen(url)
content = response.read()
root = lh.fromstring(content)
all_instruction_ids = root.xpath(XPATH_ALL_INSTRUCTION_IDS)
我尝试了 Chrome 和 Firebug 的开发人员工具 Firebug 以及其他浏览器插件提供的无数 XPath 表达式:
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a/.'
#XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a/text()'
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a[contains(normalize-space(), "")]'
XPATH_ALL_INSTRUCTION_IDS = '//*[@id="content"]/div/div/div[2]/table/tbody/tr/td[1]/font/a'
XPATH_ALL_INSTRUCTION_IDS = ".//*[@id='content']/div/div/div[2]/table/tbody/tr[2]/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "//form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "id('content')/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "/html/body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/font/a"
XPATH_ALL_INSTRUCTION_IDS = "//html//body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]//a"
XPATH_ALL_INSTRUCTION_IDS = "//html//body/form/div[1]/div[5]/div/div/div[2]/table/tbody/tr/td[1]/*/a"
然而,其中 none 在传递给 lxml.html.fromstring()
xpath()
方法时有效
//
xpath 运算符不要求您从文档的顶部开始。
XPATH_ALL_INSTRUCTION_IDS = '//font/a'
我建议您查看 xpath cheatsheet。
我会在 href
:
reference.nsf/links
的链接
//table//a[contains(@href, 'reference.nsf/links')]/text()
适合我。