xpath 在页面中找到包含 HTML 的 link

Question

这与 xpath find specific link in page 不是同一个问题。我有 <a href="http://example.com">foo baz.</a>. 并且需要通过包含结束点的完整 foo baz. 找到 link。

Answer 1

据我所知，XPath 看不到原始 HTML 标记，它在 HTML 文档的抽象层上工作。尝试将 HTML 标记包含的尽可能多的信息合并到 XPath 表达式中会产生类似这样的结果：

//a[
    node()[1][self::text() and .='foo ']
    /following-sibling::node()[1][self::em[@class='bar' and .='baz']]
    /following-sibling::node()[1][self::text() and .='.']
]

关于正在使用的谓词的简要说明：

node()[1][self::text() and .='foo '] ：让第一个子节点的文本节点的值等于 "foo"
/following-sibling::node()[1][self::em[@class='bar' and .='baz']] ：直接跟在  之后 class 等于 "bar" 并且值等于 "baz"
/following-sibling::node()[1][self::text() and .='.'] ：紧跟一个文本节点，其值等于 "."

Answer 2

这不是 100%，因为我们可以通过调用 string() 删除其他 HTML 标签，但就我的目的而言，这看起来足够了：

//a[string() = 'bar baz.']/em[@class='bar' and .='baz']

Answer 3

注意：我正在跟进OP的评论

OP 自己的答案的（视觉上）更简单的变体可能是：

//a[. = "foo baz."][em[@class = "bar"] = "baz"]

甚至：

//a[.="foo baz." and em[@class="bar"]="baz"]

（假设您想要 select <a> 节点，而不是 child ）

关于OP的问题：

why the [em[]= doesn't need the dot?

在谓词内部，针对右侧的字符串测试 = 会将左侧部分转换为字符串，此处  为其字符串表示形式，即 string() 将 return.

XPath 1.0 规范文档有 an example of this:

chapter[title="Introduction"] selects the chapter children of the context node that have one or more title children with string-value equal to "Introduction"

稍后，the same spec says 布尔测试：

If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true.

在 OP 的回答中，//a[string() = 'bar baz.']/em[@class='bar' and .='baz']，. 是必需的，因为 'baz' 上的测试是在上下文节点上

请注意，我的回答有些天真，并假设 <a> 中只有 1  child，因为 [em[@class="bar"]="baz"] 正在寻找一个 em[@class="bar"] 匹配项string-value 条件，而不是唯一或第一个条件。

考虑这个输入（第二个  child，但为空）：

<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.

并且此测试使用 Scrapy select 或

>>> import scrapy
>>> s = scrapy.Selector(text="""<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.""")
>>> s.xpath('//a[.="foo baz." and em[@class="bar"]="baz"]').extract_first()
u'<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>'
>>>

XPath 匹配，但您可能不想要它。

xpath 在页面中找到包含 HTML 的 link

xpath find link containing HTML in page

html

xpath

simplexml