如何使用 xpath select 来自具有不同 html 结构的多个元素的文本
How to select text from multiple elements with different html structure with xpath
我有这个 div,我想问一下是否可以仅使用 1 个 XPATH 命令 select "TEXT_I_NEED_X" 和 XPATH?
我最接近 select 全部的是这个,但它 select 超出了我的需要:
//div[@class="article-text-with-img"]/p//text()
<div class="article-text-with-img">
<p>
<a href="#"> Text1 </a>
</p>
<p> </p>
<p>
TEXT_I_NEED_A
<a href="#"> Text2 </a>
</p>
<p>
<span>
TEXT_I_NEED_B
<a href="#"> Text3 </a>
</span>
</p>
<p>
<span>
<span>
TEXT_I_NEED_C
<a href="#"> Text4 </a>
</span>
</span>
</p>
<p>
<span>
TEXT_I_NEED_D
</span>
<a href="#"> Text5 </a>
</p>
<p>
<span>
<spam>
TEXT_I_NEED_D
</span>
<a href="#"> Text5 </a>
</span>
</p>
</div>
示例beautifulsoup
:
from bs4 import BeautifulSoup
html_doc = <YOUR HTML SNIPPET FROM THE QUESTION>
soup = BeautifulSoup(html_doc, "html.parser")
article = soup.select_one(".article-text-with-img")
for a in article.select("a"):
a.extract()
text = [t for a in article.find_all(text=True) if (t := a.strip())]
print(text)
打印:
['TEXT_I_NEED_A', 'TEXT_I_NEED_B', 'TEXT_I_NEED_C', 'TEXT_I_NEED_D', 'TEXT_I_NEED_D']
使用单个 XPath 表达式:
//div[@class="article-text-with-img"]//a/parent::*/text() | //div[@class="article-text-with-img"]//a/preceding-sibling::span/text()
在命令行上使用 xmllint
(换行符和空格包含在 text() 中)
xmllint --html --xpath '//div[@class="article-text-with-img"]//a/parent::*/text() | //div[@class="article-text-with-img"]//a/preceding-sibling::span/text()' test.html
TEXT_I_NEED_A
TEXT_I_NEED_B
TEXT_I_NEED_C
TEXT_I_NEED_D
TEXT_I_NEED_E
我有这个 div,我想问一下是否可以仅使用 1 个 XPATH 命令 select "TEXT_I_NEED_X" 和 XPATH?
我最接近 select 全部的是这个,但它 select 超出了我的需要:
//div[@class="article-text-with-img"]/p//text()
<div class="article-text-with-img">
<p>
<a href="#"> Text1 </a>
</p>
<p> </p>
<p>
TEXT_I_NEED_A
<a href="#"> Text2 </a>
</p>
<p>
<span>
TEXT_I_NEED_B
<a href="#"> Text3 </a>
</span>
</p>
<p>
<span>
<span>
TEXT_I_NEED_C
<a href="#"> Text4 </a>
</span>
</span>
</p>
<p>
<span>
TEXT_I_NEED_D
</span>
<a href="#"> Text5 </a>
</p>
<p>
<span>
<spam>
TEXT_I_NEED_D
</span>
<a href="#"> Text5 </a>
</span>
</p>
</div>
示例beautifulsoup
:
from bs4 import BeautifulSoup
html_doc = <YOUR HTML SNIPPET FROM THE QUESTION>
soup = BeautifulSoup(html_doc, "html.parser")
article = soup.select_one(".article-text-with-img")
for a in article.select("a"):
a.extract()
text = [t for a in article.find_all(text=True) if (t := a.strip())]
print(text)
打印:
['TEXT_I_NEED_A', 'TEXT_I_NEED_B', 'TEXT_I_NEED_C', 'TEXT_I_NEED_D', 'TEXT_I_NEED_D']
使用单个 XPath 表达式:
//div[@class="article-text-with-img"]//a/parent::*/text() | //div[@class="article-text-with-img"]//a/preceding-sibling::span/text()
在命令行上使用 xmllint
(换行符和空格包含在 text() 中)
xmllint --html --xpath '//div[@class="article-text-with-img"]//a/parent::*/text() | //div[@class="article-text-with-img"]//a/preceding-sibling::span/text()' test.html
TEXT_I_NEED_A
TEXT_I_NEED_B
TEXT_I_NEED_C
TEXT_I_NEED_D
TEXT_I_NEED_E