XPath:获取没有特定@class 或@id 名称的所有元素
XPath: getting all elements without specific @class or @id name
我已经很沮丧了。我已经尝试了很多变体并在所有现有的 Whosebug 问题中搜索了答案,但没有帮助。
我只需要获取所有文本(不带@class名称'menu'或不带@id名称'menu')
我已经尝试过这些命令:
//*[not(descendant-or-self::*[(contains(@id, 'menu')) or (contains(@class, 'menu'))])]/text()[normalize-space()]
但无论我尝试什么,我总是会找回 所有文本,即使包含我排除的元素。
Ps: 我正在使用 Scrapy,它使用 XPATH 1.0
<body>
<div id="top">
<div class="topHeader">
<div class="topHeaderContent">
<a class="headerLogo" href="/Site/Home.de.html"></a>
<a class="headerText" href="/Site/Home.de.html"></a>
<div id="menuSwitch"></div>
</div>
</div>
<div class="topContent">
<div id="menuWrapper">
<nav>
<ul class="" id="menu"><li class="firstChild"><a class="topItem" href="/Site/Home.de.html">Home</a> </li>
<li class="hasChild"><span class="topItem">Produkte</span><ul class=" menuItems"><li class=""><a href="/Site/Managed_Services.de.html">Managed Services</a> </li>
<li class=""><a href="/Site/DMB/Video.de.html">VideoServices</a> </li>
<li class=""><a href="/Site/DMB/Apps.de.html">Mobile Publishing</a> </li>
<li class=""><a href="/Site/Broadcasting.de.html">Broadcasting</a> </li>
<li class=""><a href="/Site/Content_Management.de.html">Content Management</a> </li>
</ul>
</li>
<li class="hasChild"><span class="topItem">Digital Media Base</span><ul class=" menuItems"><li class=""><a href="/Site.de.html">About DMB</a> </li>
<li class=""><a href="/Site/DMB/Quellen.de.html">Quellen</a> </li>
<li class=""><a href="/Site/DMB/Video.de.html">Video</a> </li>
<li class=""><a href="/Site/DMB/Apps.de.html">Apps</a> </li>
<li class=""><a href="/Site/DMB/Web.de.html">Web</a> </li>
<li class=""><a href="/Site/DMB/Archiv.de.html">Archiv</a> </li>
<li class=""><a href="/Site/DMB/Social_Media.de.html">Social Media</a> </li>
<li class=""><a href="/Site/DMB/statistik.de.html">Statistik</a> </li>
<li class=""><a href="/Site/DMB/Payment.de.html">Payment</a> </li>
</ul>
</li>
<li class="activeMenu "><a class="topItem" href="/Site/Karriere.de.html">Karriere</a> </li>
<li class="hasChild"><span class="topItem">Fake-IT</span><ul class=" menuItems"><li class=""><a href="/Site/About.de.html">About</a> </li>
<li class=""><a href="/Site/Management.de.html">Management</a> </li>
<li class=""><a href="/Site/Mission_Statement.de.html">Mission Statement</a> </li>
<li class=""><a href="/Site/Pressemeldungen.de.html">Pressemeldungen</a> </li>
<li class=""><a href="/Site/Referenzen.de.html">Kunden</a> </li>
</ul>
</li>
</ul>
</nav>
<div class="topSearch">
<div class="topSearch">
<form action="/Site/Suchergebnis.html" method="get">
<form action="/Site/Suchergebnis.html" method="get">
<input class="searchText" onblur="processSearch(this, "Suchbegriff", "blur")" onfocus="processSearch(this,"Suchbegriff")" type="text" value="Suchbegriff" name="searchTerm" id="searchTerm" />
<input class="searchSubmit" id="js_searchSubmit" type="submit" name="yt0" />
<div class="stopFloat">
</div>
</form>
</div>
</div>
</div>
</div>
<p> I want to have this text here! </p>
.
.
More elements
.
.
</div>
<p> I want to have this text here! </p>
.
.
More elements
.
.
</body>
我总是得到这个:
['Home',
'Produkte',
'Managed Services',
'VideoServices',
'Mobile Publishing',
'Broadcasting',
'Content Management',
'Digital Media Base',
'About DMB',
'Quellen',
'Video',
'Apps',
'Web',
'Archiv',
'Social Media',
'Statistik',
'Payment',
'Karriere',
'Fake-IT',
'About',
'Management',
'Mission Statement',
'Pressemeldungen',
'Kunden',
' I want to have this text here! ',
' I want to have this text here! ']
但我需要它:
[' I want to have this text here! ',
' I want to have this text here! ']
这个非常复杂的 xpath 1.0 表达式适用于您的示例 html。在 xpath 2.0 及更高版本中会更简单一些。但是请在您的实际代码上尝试一下:
//*[not(descendant-or-self::*[contains(@class,'menu')])]
[not(descendant-or-self::*[contains(@id,'menu')])]
[not(ancestor-or-self::*[contains(@class,'menu')])]
[not(ancestor-or-self::*[contains(@id,'menu')])]//text()
好吧,如果你考虑元素
<li class=""><a href="/Site/DMB/Video.de.html">VideoServices</a> </li>
它是某个事物的后代或自身,它没有相关的 id 或 class 属性,所以它当然会被选中。
也许你想要//*[not(ancestor-or-self::*[@id='menu' or @class='menu']]
您写了 "contains",但我不确定您是不是真心的。很多人在应该使用“=”的时候却使用了contains()
。
您可以直接通过 scrapy lxml 树迭代标签,就像在这个代码示例中一样:
result = []
for tag in response.css("*"):
if 'id' not in tag.attrib and 'class' not in tag.attrib and 'href' not in tag.attrib:
text = tag.css("::text").extract_first("").strip("\n ")
if text:
result.append(tag.css("::text").extract_first())
如您所见,我还排除了具有 href
属性的标签作为 <a>
标签,如下所示:
<a href="/Site/DMB/Video.de.html">VideoServices</a>
没有 class
和 id
属性,从技术上讲,它们不会违反您的 Xpath 表达式。
因此,如果您打算使用 Xpath 选择器 - 您还需要排除具有 href
属性的标签。
我已经很沮丧了。我已经尝试了很多变体并在所有现有的 Whosebug 问题中搜索了答案,但没有帮助。
我只需要获取所有文本(不带@class名称'menu'或不带@id名称'menu') 我已经尝试过这些命令:
//*[not(descendant-or-self::*[(contains(@id, 'menu')) or (contains(@class, 'menu'))])]/text()[normalize-space()]
但无论我尝试什么,我总是会找回 所有文本,即使包含我排除的元素。
Ps: 我正在使用 Scrapy,它使用 XPATH 1.0
<body>
<div id="top">
<div class="topHeader">
<div class="topHeaderContent">
<a class="headerLogo" href="/Site/Home.de.html"></a>
<a class="headerText" href="/Site/Home.de.html"></a>
<div id="menuSwitch"></div>
</div>
</div>
<div class="topContent">
<div id="menuWrapper">
<nav>
<ul class="" id="menu"><li class="firstChild"><a class="topItem" href="/Site/Home.de.html">Home</a> </li>
<li class="hasChild"><span class="topItem">Produkte</span><ul class=" menuItems"><li class=""><a href="/Site/Managed_Services.de.html">Managed Services</a> </li>
<li class=""><a href="/Site/DMB/Video.de.html">VideoServices</a> </li>
<li class=""><a href="/Site/DMB/Apps.de.html">Mobile Publishing</a> </li>
<li class=""><a href="/Site/Broadcasting.de.html">Broadcasting</a> </li>
<li class=""><a href="/Site/Content_Management.de.html">Content Management</a> </li>
</ul>
</li>
<li class="hasChild"><span class="topItem">Digital Media Base</span><ul class=" menuItems"><li class=""><a href="/Site.de.html">About DMB</a> </li>
<li class=""><a href="/Site/DMB/Quellen.de.html">Quellen</a> </li>
<li class=""><a href="/Site/DMB/Video.de.html">Video</a> </li>
<li class=""><a href="/Site/DMB/Apps.de.html">Apps</a> </li>
<li class=""><a href="/Site/DMB/Web.de.html">Web</a> </li>
<li class=""><a href="/Site/DMB/Archiv.de.html">Archiv</a> </li>
<li class=""><a href="/Site/DMB/Social_Media.de.html">Social Media</a> </li>
<li class=""><a href="/Site/DMB/statistik.de.html">Statistik</a> </li>
<li class=""><a href="/Site/DMB/Payment.de.html">Payment</a> </li>
</ul>
</li>
<li class="activeMenu "><a class="topItem" href="/Site/Karriere.de.html">Karriere</a> </li>
<li class="hasChild"><span class="topItem">Fake-IT</span><ul class=" menuItems"><li class=""><a href="/Site/About.de.html">About</a> </li>
<li class=""><a href="/Site/Management.de.html">Management</a> </li>
<li class=""><a href="/Site/Mission_Statement.de.html">Mission Statement</a> </li>
<li class=""><a href="/Site/Pressemeldungen.de.html">Pressemeldungen</a> </li>
<li class=""><a href="/Site/Referenzen.de.html">Kunden</a> </li>
</ul>
</li>
</ul>
</nav>
<div class="topSearch">
<div class="topSearch">
<form action="/Site/Suchergebnis.html" method="get">
<form action="/Site/Suchergebnis.html" method="get">
<input class="searchText" onblur="processSearch(this, "Suchbegriff", "blur")" onfocus="processSearch(this,"Suchbegriff")" type="text" value="Suchbegriff" name="searchTerm" id="searchTerm" />
<input class="searchSubmit" id="js_searchSubmit" type="submit" name="yt0" />
<div class="stopFloat">
</div>
</form>
</div>
</div>
</div>
</div>
<p> I want to have this text here! </p>
.
.
More elements
.
.
</div>
<p> I want to have this text here! </p>
.
.
More elements
.
.
</body>
我总是得到这个:
['Home',
'Produkte',
'Managed Services',
'VideoServices',
'Mobile Publishing',
'Broadcasting',
'Content Management',
'Digital Media Base',
'About DMB',
'Quellen',
'Video',
'Apps',
'Web',
'Archiv',
'Social Media',
'Statistik',
'Payment',
'Karriere',
'Fake-IT',
'About',
'Management',
'Mission Statement',
'Pressemeldungen',
'Kunden',
' I want to have this text here! ',
' I want to have this text here! ']
但我需要它:
[' I want to have this text here! ',
' I want to have this text here! ']
这个非常复杂的 xpath 1.0 表达式适用于您的示例 html。在 xpath 2.0 及更高版本中会更简单一些。但是请在您的实际代码上尝试一下:
//*[not(descendant-or-self::*[contains(@class,'menu')])]
[not(descendant-or-self::*[contains(@id,'menu')])]
[not(ancestor-or-self::*[contains(@class,'menu')])]
[not(ancestor-or-self::*[contains(@id,'menu')])]//text()
好吧,如果你考虑元素
<li class=""><a href="/Site/DMB/Video.de.html">VideoServices</a> </li>
它是某个事物的后代或自身,它没有相关的 id 或 class 属性,所以它当然会被选中。
也许你想要//*[not(ancestor-or-self::*[@id='menu' or @class='menu']]
您写了 "contains",但我不确定您是不是真心的。很多人在应该使用“=”的时候却使用了contains()
。
您可以直接通过 scrapy lxml 树迭代标签,就像在这个代码示例中一样:
result = []
for tag in response.css("*"):
if 'id' not in tag.attrib and 'class' not in tag.attrib and 'href' not in tag.attrib:
text = tag.css("::text").extract_first("").strip("\n ")
if text:
result.append(tag.css("::text").extract_first())
如您所见,我还排除了具有 href
属性的标签作为 <a>
标签,如下所示:
<a href="/Site/DMB/Video.de.html">VideoServices</a>
没有 class
和 id
属性,从技术上讲,它们不会违反您的 Xpath 表达式。
因此,如果您打算使用 Xpath 选择器 - 您还需要排除具有 href
属性的标签。