使用 lxml 和 Xpath 抓取一个元素

Question

我遇到的问题是删除元素本身。我能够抓取前两个（IncidentNbr 和 DispatchTime ）但我无法获取地址...（1300 Dunn Ave）我希望能够抓取该元素但也让它足够动态所以我不是实际上解析“1300 Dunn Ave”我正在解析那个元素。这是源代码

<td><span id="lstCallsForService_ctrl0_lblIncidentNbr">150318182198</span></td>
<td><nobr><span id="lstCallsForService_ctrl0_lblDispatchTime">3-18 10:25</span></nobr></td>
<td>
    <a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL" target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
</td>

这是我的代码：

from lxml import html
import requests

page = requests.get('http://callsforservice.jaxsheriff.org/')
tree = html.fromstring(page.text)

callSignal = tree.xpath('//span[@id="lstCallsForService_ctrl0_lblIncidentNbr"]/text()')
dispatchTime = tree.xpath('//span[@id="lstCallsForService_ctrl0_lblDispatchTime"]/text()')
location = tree.xpath('//span[@id="lstCallsForService_ctrl0_lnkAddress"]/text()')



print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location

这是我的输出：

Call Signal:  ['150318182198']
Dispatch Time:  ['3-18 10:25']
Location:  []

知道如何抓取地址吗？

Answer 1

这是您要查找的元素：

<a id="lstCallsForService_ctrl0_lnkAddress"
   href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL"
   target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>

如您所见，它不是 span 元素。您当前的 XPath 表达式：

//span[@id="lstCallsForService_ctrl0_lnkAddress"]/text()

正在寻找具有此 ID 的 span 元素，而实际上它应该选择 a 元素。使用

//a[@id="lstCallsForService_ctrl0_lnkAddress"]/text()

相反。那么，结果应该是

Location:  ['1300 DUNN AVE']

另请阅读 alecxe 的回答，其中的建议比我的更实用。

Answer 2

首先，它是一个a元素，而不是span。 text():

前需要双斜杠

//a[@id="lstCallsForService_ctrl0_lnkAddress"]//text()

为什么是双斜杠？这是因为实际上这个 a 元素没有直接的文本节点子节点：

<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=5100 CLEVELAND RD, Jacksonville, FL" target="_blank">
    <u>5100 CLEVELAND RD</u>
</a>

您也可以通过 u 标签访问文本：

//a[@id="lstCallsForService_ctrl0_lnkAddress"]/u/text()

此外，将解决方案扩展为多个结果：

遍历 table 行
对于每一行，使用 contains()

id

使用text_content()方法获取文本

实施：

for item in tree.xpath('//tr[@class="closedCall"]'):
    callSignal = item.xpath('.//span[contains(@id, "lblIncidentNbr")]')[0].text_content()
    dispatchTime = item.xpath('.//span[contains(@id, "lblDispatchTime")]')[0].text_content()
    location = item.xpath('.//a[contains(@id, "lnkAddress")]')[0].text_content()

    print 'Call Signal: ', callSignal
    print "Dispatch Time: ", dispatchTime
    print "Location: ", location
    print "------"

打印：

Call Signal:  150318182333
Dispatch Time:  3-18 11:22
Location:  9600 APPLECROSS RD
------
Call Signal:  150318182263
Dispatch Time:  3-18 11:12
Location:  1100 E 1ST ST
------
...

使用 lxml 和 Xpath 抓取一个元素

Scraping an Element using lxml and Xpath

python

xpath

lxml

web-scraping