如何根据[r]中的NOT条件解析HTML/XML标签

Question

最亲爱的 Whosebug 兄弟们，

我正在玩 EverNote 输出的 HTML，需要解析以下内容：

笔记标题
笔记锚点（笔记本身的超链接标识）
笔记创建日期
注意内容，
Intra-notebook 超链接（笔记内容中的链接到另一个笔记的锚点）

根据 examples by Duncan Temple Lang, author of the [r] XML package and a SO answer by @jdharrison，我已经能够相对轻松地解析笔记标题、笔记锚点和笔记创建日期。对于那些可能感兴趣的人，这样做的命令是

require("XML")
rawHTML <- paste(readLines("EverNotebook.html"), collapse="\n") #Yes... this is noob code
doc = htmlTreeParse(rawHTML,useInternalNodes=T)
#Get Note Titles
html.titles<-xpathApply(doc, "//h1", xmlValue)
#Get Note Title Anchors
html.tAnchors<-xpathApply(doc, "//a[@name]", xmlGetAttr, "name")
#Get Note Creation Date
html.Dates<-xpathApply(doc, "//table[@bgcolor]/tr/td/i", xmlValue)

这是 fiddle 示例 HTML EverNote 导出。

我一直在解析 1。注意目录和2。 Intra-notebook 超链接.

仔细查看代码，很明显第一部分的解决方案是 return 每个 upper-most* div 不包含属性为 bgcolor="#D4DDE5." 的 table。 这是如何实现的？

Duncan 说可以使用 XPath 根据 NOT 条件解析 XML:

"It allows us to express things such as "find me all nodes named a" or "find me all nodes named a that have no attribute named b" or "nodes a that >have an attribute b equal to 'bob'" or "find me all nodes a which have c as >an ancestor node"

但是他没有继续描述 XML 包如何解析排除...所以我被困在那里。

解决第二部分，考虑同一笔记本中其他笔记的锚点格式：

<a href="#13178">

这些的目标是获得它们的数量，但这很困难，因为它们仅通过 # 前缀与 www 链接区分开来。关于如何通过部分匹配它们的值（在本例中 #）来解析这些特定锚点的信息很少——甚至可能需要 grep()。 如何使用 XML 包来解析这些特殊的 href？ 我在这里描述了这两个问题，因为第一部分的解决方案可能有助于第二部分...但是也许我错了。 有什么建议吗？

更新 1

由upper-mostdiv我打算说outer-mostdiv。 EverNote HMTL 导出中每条笔记的内容都在 DOM outer-most div 中。因此，感兴趣的是 return 每个 outer-most div 不包含具有属性 bgcolor="#D4DDE5." 的 table

Answer 1

"....to return every upper-most div that does NOT include a table with attribute bgcolor="#D4DDE5." How is this accomplished?"

忽略 'upper-most' 的一种可能方法，因为我不知道您将如何定义它：

//div[not(table[@bgcolor='#D4DDE5'])]

上面的 XPath 内容为：select 所有 <div> 没有具有 bgcolor 属性的子元素 <table> 等于 #D4DDE5.

我不确定问题第二部分中的 "parse" 是什么意思。如果您只想获得所有具有特殊 href 的链接，您可以使用 starts-with() 或 contains() 部分匹配 href 属性：

//a[starts-with(@href, '#')]

//a[contains(@href, '#')]

更新：

考虑 "outer-most" div :

//div[not(table[@bgcolor='#D4DDE5']) and not(ancestor::div)]

旁注：~~我不确切知道 XPath not() 是如何定义的，但如果它通常像 negation 那样工作，~~（OP 在下面的评论中确认了这一点）您可以应用 De Morgan's law 之一：

"not (A or B)" is the same as "(not A) and (not B)".

以便更新后的 XPath 可以稍微简化为：

//div[not(table[@bgcolor='#D4DDE5'] or ancestor::div)]

如何根据[r]中的NOT条件解析HTML/XML标签

How to parse HTML/XML tags according to NOT conditions in [r]

html

xpath

r

xml-parsing