Xpath：html 中两个元素之间的所有内容

Question

我正在尝试执行 xpath 以获取两个节点之间每个节点的文本，但这有点棘手。结构看起来有点像这样：

<table class="infobox biography vcard">
<p>
  <b>Mark Hamill</b>
  <span> was born in ...</span>
  <i> in the early 1980s he...</i>
  <a href="/some/place">Luke Skywalker</a>
</p>
<div class="toc">

我需要该段落中恰好在 table 和 div 之间的所有文本，这是我目前的查询 xpath： //table[@class="infobox biography vcard"]/following-sibling::node()[following-sibling::div[@class="toc"]]/text() 它不会从某些标签中获取所有文本，我该如何实现？注意：p 标签没有任何道具，但该文档中还有一些其他 p 标签

Answer 1

已编辑：

wiki-page 在 table 和 div 之间有 2 个 p 元素，像这样：

<table class="infobox biography vcard">
<!-- Content of table -->
</table>

<p>Here all kinds of mixed content</p>
<p>Here other kinds of mixed content</p>

<div class="toc">
<!-- Content of toc -->
</div>

要获取 table 和 div 之间 p 中的所有文本，请使用此 XPath：

normalize-space(//p[preceding-sibling::table[@class='infobox biography vcard'] and following-sibling::div[@class='toc']])

规范化 space 只是为了满足您的“预期输出将是 P 标签之间的所有文本”，但根据您的需要也可以使用

//p[preceding-sibling::table[@class='infobox biography vcard'] and following-sibling::*div[@class='toc']]//text()

如果您想完全控制使用那些 p 的内容，您可以从像这样获取 p 开始：

//p[preceding-sibling::table[@class='infobox biography vcard'] and following-sibling::*div[@class='toc']]

循环遍历那些 p 的混合内容，也可能使用 XPath，以获得您需要的内容。

Answer 2

如果你想要传记，你可以使用相邻的同级组合器使用 bs4 获得 table 旁边的 2 个段落。它比 xpath 更快。

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://en.wikipedia.org/wiki/Mark_Hamill')
soup = bs(r.content, 'lxml')
print(' '.join([i.text.strip() for i in soup.select('.biography + p, .biography + p + p')]))

Xpath：html 中两个元素之间的所有内容

Xpath: Everything between two elements in html

html

xpath

lxml

web-scraping

python-3.x