使用 lxml 和 Xpath 抓取一个元素
Scraping an Element using lxml and Xpath
我遇到的问题是删除元素本身。我能够抓取前两个(IncidentNbr 和 DispatchTime )但我无法获取地址...(1300 Dunn Ave)我希望能够抓取该元素但也让它足够动态所以我不是实际上解析“1300 Dunn Ave”我正在解析那个元素。这是源代码
<td><span id="lstCallsForService_ctrl0_lblIncidentNbr">150318182198</span></td>
<td><nobr><span id="lstCallsForService_ctrl0_lblDispatchTime">3-18 10:25</span></nobr></td>
<td>
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL" target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
</td>
这是我的代码:
from lxml import html
import requests
page = requests.get('http://callsforservice.jaxsheriff.org/')
tree = html.fromstring(page.text)
callSignal = tree.xpath('//span[@id="lstCallsForService_ctrl0_lblIncidentNbr"]/text()')
dispatchTime = tree.xpath('//span[@id="lstCallsForService_ctrl0_lblDispatchTime"]/text()')
location = tree.xpath('//span[@id="lstCallsForService_ctrl0_lnkAddress"]/text()')
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
这是我的输出:
Call Signal: ['150318182198']
Dispatch Time: ['3-18 10:25']
Location: []
知道如何抓取地址吗?
这是您要查找的元素:
<a id="lstCallsForService_ctrl0_lnkAddress"
href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL"
target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
如您所见,它不是 span
元素。您当前的 XPath 表达式:
//span[@id="lstCallsForService_ctrl0_lnkAddress"]/text()
正在寻找具有此 ID 的 span
元素,而实际上它应该选择 a
元素。使用
//a[@id="lstCallsForService_ctrl0_lnkAddress"]/text()
相反。那么,结果应该是
Location: ['1300 DUNN AVE']
另请阅读 alecxe 的回答,其中的建议比我的更实用。
首先,它是一个a
元素,而不是span
。 text()
:
前需要双斜杠
//a[@id="lstCallsForService_ctrl0_lnkAddress"]//text()
为什么是双斜杠?这是因为实际上这个 a
元素没有直接的文本节点子节点:
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=5100 CLEVELAND RD, Jacksonville, FL" target="_blank">
<u>5100 CLEVELAND RD</u>
</a>
您也可以通过 u
标签访问文本:
//a[@id="lstCallsForService_ctrl0_lnkAddress"]/u/text()
此外,将解决方案扩展为多个结果:
- 遍历 table 行
- 对于每一行,使用
contains()
使用部分 id
属性匹配查找单元格值
- 使用
text_content()
方法获取文本
实施:
for item in tree.xpath('//tr[@class="closedCall"]'):
callSignal = item.xpath('.//span[contains(@id, "lblIncidentNbr")]')[0].text_content()
dispatchTime = item.xpath('.//span[contains(@id, "lblDispatchTime")]')[0].text_content()
location = item.xpath('.//a[contains(@id, "lnkAddress")]')[0].text_content()
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
print "------"
打印:
Call Signal: 150318182333
Dispatch Time: 3-18 11:22
Location: 9600 APPLECROSS RD
------
Call Signal: 150318182263
Dispatch Time: 3-18 11:12
Location: 1100 E 1ST ST
------
...
我遇到的问题是删除元素本身。我能够抓取前两个(IncidentNbr 和 DispatchTime )但我无法获取地址...(1300 Dunn Ave)我希望能够抓取该元素但也让它足够动态所以我不是实际上解析“1300 Dunn Ave”我正在解析那个元素。这是源代码
<td><span id="lstCallsForService_ctrl0_lblIncidentNbr">150318182198</span></td>
<td><nobr><span id="lstCallsForService_ctrl0_lblDispatchTime">3-18 10:25</span></nobr></td>
<td>
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL" target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
</td>
这是我的代码:
from lxml import html
import requests
page = requests.get('http://callsforservice.jaxsheriff.org/')
tree = html.fromstring(page.text)
callSignal = tree.xpath('//span[@id="lstCallsForService_ctrl0_lblIncidentNbr"]/text()')
dispatchTime = tree.xpath('//span[@id="lstCallsForService_ctrl0_lblDispatchTime"]/text()')
location = tree.xpath('//span[@id="lstCallsForService_ctrl0_lnkAddress"]/text()')
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
这是我的输出:
Call Signal: ['150318182198']
Dispatch Time: ['3-18 10:25']
Location: []
知道如何抓取地址吗?
这是您要查找的元素:
<a id="lstCallsForService_ctrl0_lnkAddress"
href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL"
target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
如您所见,它不是 span
元素。您当前的 XPath 表达式:
//span[@id="lstCallsForService_ctrl0_lnkAddress"]/text()
正在寻找具有此 ID 的 span
元素,而实际上它应该选择 a
元素。使用
//a[@id="lstCallsForService_ctrl0_lnkAddress"]/text()
相反。那么,结果应该是
Location: ['1300 DUNN AVE']
另请阅读 alecxe 的回答,其中的建议比我的更实用。
首先,它是一个a
元素,而不是span
。 text()
:
//a[@id="lstCallsForService_ctrl0_lnkAddress"]//text()
为什么是双斜杠?这是因为实际上这个 a
元素没有直接的文本节点子节点:
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=5100 CLEVELAND RD, Jacksonville, FL" target="_blank">
<u>5100 CLEVELAND RD</u>
</a>
您也可以通过 u
标签访问文本:
//a[@id="lstCallsForService_ctrl0_lnkAddress"]/u/text()
此外,将解决方案扩展为多个结果:
- 遍历 table 行
- 对于每一行,使用
contains()
使用部分 - 使用
text_content()
方法获取文本
id
属性匹配查找单元格值
实施:
for item in tree.xpath('//tr[@class="closedCall"]'):
callSignal = item.xpath('.//span[contains(@id, "lblIncidentNbr")]')[0].text_content()
dispatchTime = item.xpath('.//span[contains(@id, "lblDispatchTime")]')[0].text_content()
location = item.xpath('.//a[contains(@id, "lnkAddress")]')[0].text_content()
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
print "------"
打印:
Call Signal: 150318182333
Dispatch Time: 3-18 11:22
Location: 9600 APPLECROSS RD
------
Call Signal: 150318182263
Dispatch Time: 3-18 11:12
Location: 1100 E 1ST ST
------
...