XPath：获取没有特定@class 或@id 名称的所有元素

Question

我已经很沮丧了。我已经尝试了很多变体并在所有现有的 Whosebug 问题中搜索了答案，但没有帮助。

我只需要获取所有文本（不带@class名称'menu'或不带@id名称'menu') 我已经尝试过这些命令：

//*[not(descendant-or-self::*[(contains(@id, 'menu')) or (contains(@class, 'menu'))])]/text()[normalize-space()]

但无论我尝试什么，我总是会找回 所有文本，即使包含我排除的元素。

Ps: 我正在使用 Scrapy，它使用 XPATH 1.0

<body>
  <div id="top">
    <div class="topHeader">
      <div class="topHeaderContent">
        <a class="headerLogo" href="/Site/Home.de.html"></a>
        <a class="headerText" href="/Site/Home.de.html"></a>
        <div id="menuSwitch"></div>
      </div>
    </div>

    <div class="topContent">
      <div id="menuWrapper">
        <nav>
          <ul class="" id="menu"><li class="firstChild"><a class="topItem" href="/Site/Home.de.html">Home</a>     </li>
            <li class="hasChild"><span class="topItem">Produkte</span><ul class=" menuItems"><li class=""><a href="/Site/Managed_Services.de.html">Managed Services</a>             </li>
              <li class=""><a href="/Site/DMB/Video.de.html">VideoServices</a>                </li>
              <li class=""><a href="/Site/DMB/Apps.de.html">Mobile Publishing</a>             </li>
              <li class=""><a href="/Site/Broadcasting.de.html">Broadcasting</a>              </li>
              <li class=""><a href="/Site/Content_Management.de.html">Content Management</a>      </li>
            </ul>
          </li>
          <li class="hasChild"><span class="topItem">Digital Media Base</span><ul class=" menuItems"><li class=""><a href="/Site.de.html">About DMB</a>           </li>
            <li class=""><a href="/Site/DMB/Quellen.de.html">Quellen</a>            </li>
            <li class=""><a href="/Site/DMB/Video.de.html">Video</a>                </li>
            <li class=""><a href="/Site/DMB/Apps.de.html">Apps</a>          </li>
            <li class=""><a href="/Site/DMB/Web.de.html">Web</a>            </li>
            <li class=""><a href="/Site/DMB/Archiv.de.html">Archiv</a>              </li>
            <li class=""><a href="/Site/DMB/Social_Media.de.html">Social Media</a>          </li>
            <li class=""><a href="/Site/DMB/statistik.de.html">Statistik</a>                </li>
            <li class=""><a href="/Site/DMB/Payment.de.html">Payment</a>            </li>
          </ul>
        </li>
        <li class="activeMenu "><a class="topItem" href="/Site/Karriere.de.html">Karriere</a>           </li>
        <li class="hasChild"><span class="topItem">Fake-IT</span><ul class=" menuItems"><li class=""><a href="/Site/About.de.html">About</a>             </li>
          <li class=""><a href="/Site/Management.de.html">Management</a>          </li>
          <li class=""><a href="/Site/Mission_Statement.de.html">Mission Statement</a>        </li>
          <li class=""><a href="/Site/Pressemeldungen.de.html">Pressemeldungen</a>            </li>
          <li class=""><a href="/Site/Referenzen.de.html">Kunden</a>              </li>
        </ul>
      </li>
    </ul>
  </nav>
  <div class="topSearch">
    <div class="topSearch">
      <form action="/Site/Suchergebnis.html" method="get">
        <form action="/Site/Suchergebnis.html" method="get">
          <input class="searchText" onblur="processSearch(this, &quot;Suchbegriff&quot;, &quot;blur&quot;)" onfocus="processSearch(this,&quot;Suchbegriff&quot;)" type="text" value="Suchbegriff" name="searchTerm" id="searchTerm" />
          <input class="searchSubmit" id="js_searchSubmit" type="submit" name="yt0" />
          <div class="stopFloat">
          </div>
        </form>
      </div>
    </div>
  </div>
</div>
<p> I want to have this text here! </p>
.
.
More elements
.
.
</div>
<p> I want to have this text here! </p>
.
.
More elements
.
.
</body>

我总是得到这个：

['Home',
 'Produkte',
 'Managed Services',
 'VideoServices',
 'Mobile Publishing',
 'Broadcasting',
 'Content Management',
 'Digital Media Base',
 'About DMB',
 'Quellen',
 'Video',
 'Apps',
 'Web',
 'Archiv',
 'Social Media',
 'Statistik',
 'Payment',
 'Karriere',
 'Fake-IT',
 'About',
 'Management',
 'Mission Statement',
 'Pressemeldungen',
 'Kunden',
 ' I want to have this text here! ',
 ' I want to have this text here! ']

但我需要它：

[' I want to have this text here! ',
 ' I want to have this text here! ']

Answer 1

这个非常复杂的 xpath 1.0 表达式适用于您的示例 html。在 xpath 2.0 及更高版本中会更简单一些。但是请在您的实际代码上尝试一下：

 //*[not(descendant-or-self::*[contains(@class,'menu')])]
 [not(descendant-or-self::*[contains(@id,'menu')])]
 [not(ancestor-or-self::*[contains(@class,'menu')])]
 [not(ancestor-or-self::*[contains(@id,'menu')])]//text()

Answer 2

好吧，如果你考虑元素

<li class=""><a href="/Site/DMB/Video.de.html">VideoServices</a> </li>

它是某个事物的后代或自身，它没有相关的 id 或 class 属性，所以它当然会被选中。

也许你想要//*[not(ancestor-or-self::*[@id='menu' or @class='menu']]

您写了 "contains"，但我不确定您是不是真心的。很多人在应该使用“=”的时候却使用了contains()。

Answer 3

您可以直接通过 scrapy lxml 树迭代标签，就像在这个代码示例中一样：

result = []
for tag in response.css("*"):
    if 'id' not in tag.attrib and 'class' not in tag.attrib and 'href' not in tag.attrib:
        text = tag.css("::text").extract_first("").strip("\n ")
        if text:
            result.append(tag.css("::text").extract_first())

如您所见，我还排除了具有 href 属性的标签作为 <a> 标签，如下所示：
<a href="/Site/DMB/Video.de.html">VideoServices</a> 没有 class 和 id 属性，从技术上讲，它们不会违反您的 Xpath 表达式。
因此，如果您打算使用 Xpath 选择器 - 您还需要排除具有 href 属性的标签。

XPath：获取没有特定@class 或@id 名称的所有元素

XPath: getting all elements without specific @class or @id name

xpath

xpath-1.0

scrapy