抓取标题下的所有 child 段（最好是 rvest）

Question

我的 objective 是在一个相当大的 word 文档上使用 library(tm) 工具包。 Word 文档具有合理的排版，因此我们有 h1 用于主要部分，一些 h2 和 h3 副标题。我想对每个部分进行比较和文本挖掘（每个部分下方的文本 h1 - 副标题并不重要 - 因此可以包含或排除它们。）

我的策略是将 worddocument 导出到 html，然后使用 rvestpacakge 提取段落。

library(rvest)
# the file has latin-1 chars
#Sys.setlocale(category="LC_ALL", locale="da_DK.UTF-8")
# small example html file
file <- rvest::html("https://83ae1009d5b31624828197160f04b932625a6af5.googledrive.com/host/0B9YtZi1ZH4VlaVVCTGlwV3ZqcWM/tidy.html", encoding = 'utf-8')

nodes <- file %>%
  rvest::html_nodes("h1>p") %>%
  rvest::html_text()

我可以用 html_nodes("p") 提取所有 <p>，但这只是一大汤。我需要分别分析每个 h1。

最好的可能是一个列表，每个 h1 标题有一个 p 标签向量。也许是一个像 for (i in 1:length(html_nodes(fil, "h1"))) (html_children(html_nodes(fil, "h1")[i])) 这样的循环（这是行不通的）。

如果有办法从 rvest

中整理单词 html 则加分

Answer 1

注意> is the child combinator； selector 您当前查找的 p 元素属于 h1 的 children ，这没有意义在 HTML 等 returns 什么都没有。

如果您检查生成的标记，至少在您提供的示例文档中，您会注意到每个 h1 元素（以及 table 的标题内容，而是标记为 p）具有关联的 parent div:

<body lang="EN-US"> <div class="WordSection1"> <p class="MsoTocHeading"><span lang="DA" class='c1'>Indholdsfortegnelse</span></p> ... </div><span lang="DA" class='c5'><br clear="all" class='c4'></span> <div class="WordSection2"> <h1><a name="_Toc285441761"><span lang="DA">Interview med Jakob skoleleder på a_skolen</span></a></h1> ... </div><span lang="DA" class='c5'><br clear="all" class='c4'></span> <div class="WordSection3"> <h1><a name="_Toc285441762"><span lang="DA">Interviewet med Andreas skoleleder på b_skolen</span></a></h1> ... </div> </body>

每个部分中由 h1 表示的所有 p 元素都可以在其各自的 parent div 中找到。考虑到这一点，您可以简单地将 select p 个元素作为每个 h1 的兄弟元素。但是，由于 rvest 目前没有办法从上下文节点 select 兄弟姐妹（html_nodes() 只支持查看节点的子树，即它的后代），您将需要以另一种方式执行此操作。

假设 HTML Tidy 创建了一个结构，其中每个 h1 都在直接位于 body 中的 div 中，你可以抓取每个 div 除了table 的内容使用以下 select 或：

sections <- html_nodes(file, "body > div ~ div")

在您的示例文档中，这应该导致 div.WordSection2 和 div.WordSection3。 table的内容用div.WordSection1表示，排除在select之外。

然后从每个div中提取段落：

for (section in sections) { paras <- html_nodes(section, "p") # Do stuff with paragraphs in each section... print(length(paras)) } # [1] 9 # [1] 8

如您所见，length(paras)对应于每个div中的p个元素的数量。请注意，其中一些只包含一个  ，这可能会很麻烦，具体取决于您的需要。我将把处理这些异常值作为练习留给 reader.

不幸的是，我没有加分，因为 rvest 不提供自己的 HTML Tidy 功能。您将需要单独处理您的 Word 文档。

抓取标题下的所有 child 段（最好是 rvest）

Scrape all child paragraphs under heading (preferable rvest)

r

css-selectors

web-scraping

rvest