使用 XPath（或其他）描绘大块文本的 div

Question

给定一个页面，如 this，有两个作业（我们暂时忽略 'Open applications'）一个接一个地完整描述，我正在寻找一种可靠的方法来提取在div双重工作规范中。第一个目标是提取规范，然后希望将它们包装在一些封闭的 HTML 标记中，以便它们在保存为 HTML 文件时在浏览器中呈现。

显然，如果我事先知道顶层 div 的 class 名称称为 "jobitem"，我可以运行一个简单的 XPath，例如 [=11] =]

虽然会有几个这样的网站（设计差异很大，但所有网站都一个接一个地列出了完整的工作规格），我的程序不会有这样的奢侈这样的 class 提前命名知识。我的程序将知道一件事：工作标题的绝对和相对位置（<h2>、<h3> 等）。换句话说，我将运行进行如下查询：

//*[self::h2 or self::h3 or self::h4][contains(., 'Country Manager')]

... 生成一个 Python lxml XPath objects 数组，然后可以从中执行相对 XPath。也许这些知识是抓取每个标题之间所有文本的起点？

Answer 1

"... resulting in an array of Python lxml XPath objects, from which relative XPaths can then be performed. Perhaps this knowledge is a starting point for grabbing all text in between each heading?"

当然（如果我理解正确的话），此时任务很简单，在相对 XPath 中使用 following-sibling 轴：

following-sibling::div

使用 XPath（或其他）描绘大块文本的 div

Delineating divs of large chunks of text with XPath (or other)

html

python

xhtml

xpath

lxml