Xpath 问题选择 <spans> 嵌套在 <td>

Question

我正在尝试使用一个程序从大量 XHTML 文档中提取文本，该程序使用 Xpath 查询将文本映射到结构化 table。 XHTML 文档看起来像这样

<td class="td-3 c12" valign="top">
 <p class="pa-4">
  <span class="ca-5">text I would like to select </span>
 </p>
</td>
<td class="td-3 c13" valign="top">
 <p class="pa-2">
  <span class="ca-0">some more text I want to select </span>
 </p>
 <p class="pa-2">
  <span class="ca-0">
 <br>
 </br>
  </span>
 </p>
 <p class="pa-2">
 <span class="ca-5">text and values I don't want to select.</span>
 </p>
 <p class="pa-2">
  <span class="ca-5"> also text and values I don't want to </span>
 </p>
</td>

我可以 select 按 class 的跨度并检索 text/values，但是它们不够独特，我需要按 table classes。例如，只有来自 span class ca-0 的文本是 td class td-3 c13 的子项

这将是 <span class="ca-0">some more text I want to select </span>

我已经尝试了所有这些组合

//xhtml:td[@class="td-3 c13"]/xhtml:span[@class = "ca-0"]

//xhtml:span[@class = "ca-0"] //ancestor::xhtml:td[@class= "td-3 c13"]

//xhtml:td[@class="td-3 c6"]//xhtml:span[@class = "ca-0"]

Answer 1

我不确定你的样本 xml 在多大程度上反映了你的实际 xml，但严格基于你的样本 xml（并忽略你可能会遇到的命名空间问题），以下 xpath 表达式：

//td[contains(@class,"td-3")]/p[1]/span/text()

选择

text I would like to select
some more text I want to select

Answer 2

根据 doc，为了支持名称空间，您应该这样写 (fn:...) :

//*:td[fn:contains(@class,"td-3")]/*:p[1]/*:span

或使用绑定命名空间：

node.xpath("//xhtml:td[fn:contains(@class,'td-3')]/xhtml:p[1]/xhtml:span", {"xhtml":"http://example.com/ns"})

这个表达式也应该有效（select 每个 td 元素的第一个 p 的第一个跨度）：

//*:td/*:p[1]/*:span[1]

旁注：

您的 XPath 表达式可能已修复。 Span 不是 child 而是后代，所以我们使用 //。我们使用 () 只保留第一个结果。

(//xhtml:td[@class="td-3 c13"]//xhtml:span[@class = "ca-0"])[1]
(//xhtml:td[@class="td-3 c6"]//xhtml:span[@class = "ca-0"])[1]

用谓词 [] 替换 // :

(//xhtml:span[@class = "ca-0"][ancestor::xhtml:td[@class= "td-3 c13"]])[1]

使用以下命令测试您的 XPath：https://docs.marklogic.com/cts.validIndexPath

Answer 3

解决方法是 //td[(@class ="td-3") and (@class = "c13)]/p/span

出于某种原因，它看到

<td class="td-3 c13">

作为单独的类例如

<td class = "td-3" and class = "c13"

所以你需要这样对待它们

感谢@E.Wiest 和@JackFleeting 的验证并为我指明了正确的方向。

Xpath 问题选择 <spans> 嵌套在 <td>

Xpath issues selecting <spans> nested in <td>

html

xml

xhtml

xpath

xquery