用 xml2 解析 html-comments
Parse html-comments with xml2
我开始尝试使用 xml2
-包来解析一些 Rmarkdown-Files。现在,我对以结构化方式解析 html-comments 以及解析部分之间的信息(例如 ####
等)
非常感兴趣
我目前尝试访问评论的内容可以在下面找到。
library(xml2)
library(magrittr)
# html-output as created by rmarkdown
x <- xml2::read_xml("
<div id='header-level-1' class='section level1'>
<h1>Header Level 1</h1>
<!-- This is a comment, which I want to parse -->
<div id='header-level-4-1' class='section level4'>
<h4>Header Level 4 (1)</h4>
<!-- parse me 4 (1) -->
<p>Hello!</p>
</div>
<div id='header-level-4-2' class='section level4'>
<h4>Header Level 4 (2)</h4>
<!-- parse me 4 (2) -->
<p>How are you?</p>
<pre class='r'><code>print("Hello World")</code></pre>
</div>
</div>
")
# inspecting the structure, {comments} are present as a structural element
x %>%
html_structure()
#> <div#header-level-1 .section.level1>
#> <h1>
#> {text}
#> {comment}
#> <div#header-level-4-1 .section.level4>
#> <h4>
#> {text}
#> {comment}
#> <p>
#> {text}
#> <div#header-level-4-2 .section.level4>
#> <h4>
#> {text}
#> {comment}
#> <p>
#> {text}
#> <pre.r>
#> <code>
#> {text}
# first attempt to acess content of comments
x %>%
xml_find_all("//div") %>%
sub("^.*<!-- ", "", .) %>%
sub(" -->.*$", "", .)
#> [1] "parse me 4 (2)" "parse me 4 (1)" "parse me 4 (2)"
我确定,有更好的方法吗?理想情况下,我会获取评论并保持层次结构(例如这些评论属于哪个标题)
xml_find_all(x, ".//*/comment()/../div")
## {xml_nodeset (2)}
## [1] <div id="header-level-4-1" class="section level4">\n <h4>Header Level 4 (1)</h4>\n <!-- parse me 4 (1) -->\n <p>He ...
## [2] <div id="header-level-4-2" class="section level4">\n <h4>Header Level 4 (2)</h4>\n <!-- parse me 4 (2) -->\n <p>Ho ...
我开始尝试使用 xml2
-包来解析一些 Rmarkdown-Files。现在,我对以结构化方式解析 html-comments 以及解析部分之间的信息(例如 ####
等)
我目前尝试访问评论的内容可以在下面找到。
library(xml2)
library(magrittr)
# html-output as created by rmarkdown
x <- xml2::read_xml("
<div id='header-level-1' class='section level1'>
<h1>Header Level 1</h1>
<!-- This is a comment, which I want to parse -->
<div id='header-level-4-1' class='section level4'>
<h4>Header Level 4 (1)</h4>
<!-- parse me 4 (1) -->
<p>Hello!</p>
</div>
<div id='header-level-4-2' class='section level4'>
<h4>Header Level 4 (2)</h4>
<!-- parse me 4 (2) -->
<p>How are you?</p>
<pre class='r'><code>print("Hello World")</code></pre>
</div>
</div>
")
# inspecting the structure, {comments} are present as a structural element
x %>%
html_structure()
#> <div#header-level-1 .section.level1>
#> <h1>
#> {text}
#> {comment}
#> <div#header-level-4-1 .section.level4>
#> <h4>
#> {text}
#> {comment}
#> <p>
#> {text}
#> <div#header-level-4-2 .section.level4>
#> <h4>
#> {text}
#> {comment}
#> <p>
#> {text}
#> <pre.r>
#> <code>
#> {text}
# first attempt to acess content of comments
x %>%
xml_find_all("//div") %>%
sub("^.*<!-- ", "", .) %>%
sub(" -->.*$", "", .)
#> [1] "parse me 4 (2)" "parse me 4 (1)" "parse me 4 (2)"
我确定,有更好的方法吗?理想情况下,我会获取评论并保持层次结构(例如这些评论属于哪个标题)
xml_find_all(x, ".//*/comment()/../div")
## {xml_nodeset (2)}
## [1] <div id="header-level-4-1" class="section level4">\n <h4>Header Level 4 (1)</h4>\n <!-- parse me 4 (1) -->\n <p>He ...
## [2] <div id="header-level-4-2" class="section level4">\n <h4>Header Level 4 (2)</h4>\n <!-- parse me 4 (2) -->\n <p>Ho ...