Select 两个元素之间的所有节点，使用 XPath 从交集中排除不必要的元素

Question

有一个文档结构如下：

<div class="document">

    <div class="title">
        <AAA/>
    </div class="title">

    <div class="lead">
        <BBB/>
    </div class="lead">

    <div class="photo">
        <CCC/>
    </div class="photo"> 

    <div class="text">
    <!-- tags in text sections can vary. they can be `div` or `p` or anything. -->
        <DDD>
            <EEE/>
            <DDD/>
            <CCC/>
            <FFF/>
                <FFF>
                    <GGG/>
                </FFF>
        </DDD>
    </div class="text">

    <div class="more_text">
        <DDD>
        <EEE/>
            <DDD/>
            <CCC/>
            <FFF/>
                <FFF>
                    <GGG/>
                </FFF>
        </DDD>
    </div class="more_text">

    <div class="other_stuff">
        <DDD/>
    </div class="other_stuff">

</div class="document">

任务是抓取<div class="lead">到<div class="other_stuff">之间的所有元素，除了<div class="photo">元素

节点集交集的 Kayessian 方法 $ns1[count(.|$ns2) = count($ns2)] 非常有效。将 $ns1 替换为 //*[@class="lead"]/following::* 并将 $ns2 替换为 //*[@class="other_stuff"]/preceding::* 后，工作代码如下所示：

//*[@class="lead"]/following::*[count(. | //*[@class="other_stuff"]/preceding::*)
= count(//*[@class="other_stuff"]/preceding::*)]/text()

它 select 包含 <div class="lead"> 和 <div class="other_stuff"> 之间的所有内容，包括 <div class="photo"> 元素。我尝试了几种方法来插入 not() select 或在公式本身

//*[@class="lead" and not(@class="photo ")]/following::*
//*[@class="lead"]/following::*[not(@class="photo ")]
//*[@class="lead"]/following::*[not(self::class="photo ")]

（与 /preceding::* 部分相同）但它们不起作用。看起来这个 not() 方法被忽略了——<div class="photo"> 元素保留在 selection 中。

问题一： 如何从这个交集中排除不需要的元素？

这不是 select from <div class="photo"> 元素自动排除的选项，因为在其他文档中它可以出现在任何位置或根本不出现。

问题2（补充）： 在[=31之后使用*可以吗=] 和 preceding:: 在这种情况下？

它最初 select 包含整个文档的结尾和开头。为 following:: 和 preceding:: 方法指定确切的终点会更好吗？我尝试了 //*[@class="lead"]/following::[@class="other_stuff"] 但它似乎不起作用。

Answer 1

Question 1: How to exclude the unnecessary element from this intersection?

向您的工作 XPath 添加另一个谓词，在本例中为 [not(self::div[@class='photo'])]。对于这种特殊情况，整个 XPath 看起来像这样（为了便于阅读而格式化）：

//*[@class="lead"]
 /following::*[
    count(. | //*[@class="other_stuff"]/preceding::*) 
        = 
    count(//*[@class="other_stuff"]/preceding::*)
 ][not(self::div[@class='photo'])]
/text()

Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?

我不确定它是否会是 'better'，我能说的是 following::[@class="other_stuff"] 是无效的表达式。您需要提及将应用谓词的元素，例如，'any element' following::*[@class="other_stuff"]，或只是 'div' following::div[@class="other_stuff"].

Select 两个元素之间的所有节点，使用 XPath 从交集中排除不必要的元素

Select all nodes between two elements excluding unnecessary element from the intersection using XPath

xpath

scrapy-spider