需要 lxml xpath 表达式帮助

lxml xpath expression help needed

我有以下 HTML 来自网页 view:source 的

<a target="_blank" rel="nofollow" href="http://www.facebook.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#facebook"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#linkedin"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.youtube.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#youtube"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.twitter.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#twitter"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.014media.com?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#website"></use></svg></div>
</a>

使用下面的 xpath 表达式我试图解析 LinkedIn URL 但无法做到。

from lxml import html, etree

asd = """<a target="_blank" rel="nofollow" href="http://www.facebook.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#facebook"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#linkedin"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.youtube.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#youtube"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.twitter.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#twitter"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.014media.com?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#website"></use></svg></div>
</a>"""

html.fromstring(asd.replace("xlink:href","xlinkhref")).xpath('(//a//div//svg//use[contains(@xlinkhref,"linkedin")])//@href')

输出是

[]

由于 lxml.etree.XPathEvalError: Undefined namespace prefix 错误,我不得不更换 ":",但仍然无法理解我哪里做错了,非常感谢任何建议。

使用 re 我可以解析我需要的东西,但仍然找不到 lxml 的解决方案

[each.split('"')[0] for each in re.findall('<a target="_blank" rel="nofollow" href="(.+?)</a>',asd,re.DOTALL) if '/sprite.svg#linkedin' in each][0].split('?')[0]

我从来没有真正使用过 lxml 的 html;只有etree。它 (html) 对待命名空间的方式似乎与 etree 略有不同。

在您的示例数据中,名称空间前缀 xref 未绑定到名称空间 uri。即使我添加声明来绑定它 (xmlns:xlink="http://www.w3.org/1999/xlink"),它的工作方式似乎与 etree 不同(将 "namespaces" dict arg 添加到 xpath())。

另一个例子是 use 元素。它在默认命名空间 https://www.w3.org/2000/svg 中,但如果我添加 namespaces={"svg": "https://www.w3.org/2000/svg"} 并在 xpath (svg:use) 中使用前缀,它不会 select 任何东西。仅当我使用不带前缀的 use 时才有效。

如果您的实际数据格式正确,包括绑定 xlink 前缀,您可以使用 etree 并映射前缀。

如果没有,您将不得不坚持 html 并使用一些 local-name() 技巧。 (其他奇怪的是 html 在本地名称中包含前缀,因此您必须匹配 xlink:href 而不仅仅是 href。)

这是两者的示例...

from lxml import html, etree

# --------------------- TEST USING html --------------------------------------------------------------------------------

# The xlink namespace prefix is not bound to a namespace uri so this is not namespace well-formed.
asd = """<a target="_blank" rel="nofollow" href="http://www.facebook.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#facebook"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#linkedin"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.youtube.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#youtube"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.twitter.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#twitter"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.014media.com?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#website"></use></svg></div>
</a>"""

href = html.fromstring(asd).xpath('//a[.//use/@*[local-name()="xlink:href"][contains(.,"linkedin")]]/@href')[0]
print(f"Results using html:  {href}")

# --------------------- TEST USING etree -------------------------------------------------------------------------------

# Modified to include binding of xlink namespace prefix to a namespace uri to make it well formed.
asd2 = """<html xmlns:xlink="http://www.w3.org/1999/xlink">
<a target="_blank" rel="nofollow" href="http://www.facebook.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#facebook"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#linkedin"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.youtube.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#youtube"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.twitter.com/014media?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#twitter"></use></svg></div>
</a><a target="_blank" rel="nofollow" href="http://www.014media.com?utm_source=Thalamus.co&amp;utm_medium=AdVendorPage&amp;utm_content=https://www.thalamus.co/buyers/014-media"><div class="icon--rounded icon"><svg xmlns="https://www.w3.org/2000/svg"><use xlink:href="/sprite.svg#website"></use></svg></div>
</a>
</html>"""

namespaces = {"svg": "https://www.w3.org/2000/svg", "xlink": "http://www.w3.org/1999/xlink"}
href2 = etree.fromstring(asd2).xpath('//a[.//svg:use[contains(@xlink:href,"linkedin")]]/@href', namespaces=namespaces)[0]
print(f"Results using etree: {href2}")

这将输出以下...

Results using html:  http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&utm_medium=AdVendorPage&utm_content=https://www.thalamus.co/buyers/014-media
Results using etree: http://www.linkedin.com/company/014-media?utm_source=Thalamus.co&utm_medium=AdVendorPage&utm_content=https://www.thalamus.co/buyers/014-media