Xpath

Question

我正在尝试 select 以下结构中特定类型的所有第一次出现：

<div class="jobs-list">
    <div class="job-listing">
        <h3>Title1</h3>
        <span class="organization">
            <a href="https://www.domain1.org/" target="_blank">Org1</a>
        </span>
        <span class="location">Loc1</span>
        <div class="description">
            desc1
            <a href="https://www.domain1-1.org/" target="_blank">https://www.domain1-1.org/</a>
            <span class="list-date">Posted on: 01/19/2022</span>
        </div>
    </div>
    <div class="job-listing">
        <h3>Title2</h3>
        <span class="organization">
            <a href="https://www.domain2.org/" target="_blank">Org2</a>
        </span>
        <span class="location">Loc2</span>
        <div class="description">
            desc2
            <a href="https://www.domain2.org/" target="_blank">https://www.domain2.org/</a>
            <span class="list-date">Posted on: 01/18/2022</span>
        </div>
    </div>
    <div class="job-listing">
        <h3>Title3</h3>
        <span class="organization">
            <a href="https://www.domain3.org/" target="_blank">Org3</a>
        </span>
        <span class="location">Loc3</span>
        <div class="description">
            desc3            
            <a href="mailto:user@domain3.org">user@domain3.org</a>
            <span class="list-date">Posted on: 01/19/2022</span>
        </div>
    </div>
    <div class="job-listing">
        <h3>TItle4</h3>
        <span class="organization">Org4</span>
        <span class="location">Loc4</span>
        <div class="description">
            desc4
            <a href="mailto:user@domain4.org">user@domain4.org</a>
            <a href="https://www.domain4.org/" target="_blank">https://www.domain4.org/</a>
            <a href="https://www.domain4-1.org/" target="_blank">https://www.domain4-1.org/</a>
            <span class="list-date">Posted on: 01/06/2022</span>
        </div>
    </div>
</div>

具体来说，我需要的结果如下：

https://www.domain1.org/
https://www.domain2.org/
https://www.domain3.org/
https://www.domain4.org/

应该是每个 div[@class='job-listing'] 下的第一个 a/@href，但我不确定如何表达。一些注意事项：

<a> 始终是根目录下的两个节点（工作列表）
第一个 <a> 并不总是正确的（只查找 http），但我可以很容易地过滤掉它们；我正在研究如何 select 节点，而不是过滤内容或类似的东西。
我需要 a/@href 的值，而不是 <a> 的内容。

谢谢！

Answer 1

//div[@class='job-listing']/descendant::a[1] 给你每个 div 的第一个 a 后代，如果你想添加检查然后使用例如//div[@class='job-listing']/descendant::a[starts-with(@href, 'http')][1].

如果您需要 href 属性节点，请使用 //div[@class='job-listing']/descendant::a[starts-with(@href, 'http')][1]/@href。请注意，XSLT 或 XQuery 的某些默认序列化不允许您序列化独立属性节点的序列，但在 XPath 2 或 3 中，您当然可以使用例如//div[@class='job-listing']/descendant::a[starts-with(@href, 'http')][1]/@href/string() 改为获取一系列属性值。

Answer 2

我建议使用更基于 class 的选择器：

//span[@class="organization"]//a/@href 
| 
//div[@class="description"][not(preceding-sibling::span/a)]
//a[contains(@href,"http")][1]/@href

Select links 在 organization (A) 下，第一个 http link 在 description 下没有遇到 A

见live tester link

Xpath - select 第一次出现具有特定类型的节点

Xpath - select first occurence of node with specific type

html

xml

xslt