在 R 中使用 rvest(或类似)提取多于一种类型的元素同时保留顺序?
Extract more than one type of element whilst preserving order using rvest (or similar) in R?
我试图在 HTML 文档中提取匹配 2 种不同类型的元素,同时保留顺序。
单独提取每种元素类型很简单(参见示例),但我不知道如何一次性提取它们并保持它们在网页中出现的顺序。
最小示例
这是一些假人 HTML
dummy_html <- "<p>hi there</p>
<p>2nd para</p>
<div>unwanted stuff</div>
<span>something new</span>
<p>3rd para</p>
<span>extra stuff</span>
<div>more unwanted stuff</div>
<p>4th para</p>"
假设我们希望提取所有 p
元素和所有 span
元素(并保持它们出现的顺序)
# p elements on their own
library(rvest)
dummy_html %>% read_html %>% html_nodes("p")
{xml_nodeset (4)}
[1] <p>hi there</p>
[2] <p>2nd para</p>
[3] <p>3rd para</p>
[4] <p>4th para</p>
# span elements on their own
dummy_html %>% read_html %>% html_nodes("span")
{xml_nodeset (2)}
[1] <span>something new</span>
[2] <span>extra stuff</span>
但是我们如何才能提取所有 either 元素?即所有 p 元素和所有 span 元素 在一起 因此所需的输出是:
{xml_nodeset (6)}
[1] <p>hi there</p>
[2] <p>2nd para</p>
[3] <span>something new</span>
[4] <p>3rd para</p>
[5] <span>extra stuff</span>
[6] <p>4th para</p>
注意顺序的保存(即p
和span
相互拼接)
到目前为止我尝试了什么
我尝试了明显的 dummy_html %>% read_html %>% html_nodes("span|p")
但它抛出了一个错误。
您可以使用 CSS 或 XPath 语法来完成;你的 CSS 只需要 ,
而不是 |
:
library(rvest)
#> Loading required package: xml2
dummy_html <- "<p>hi there</p>
<p>2nd para</p>
<div>unwanted stuff</div>
<span>something new</span>
<p>3rd para</p>
<span>extra stuff</span>
<div>more unwanted stuff</div>
<p>4th para</p>"
# With CSS
dummy_html %>% read_html() %>% html_nodes("p,span")
#> {xml_nodeset (6)}
#> [1] <p>hi there</p>
#> [2] <p>2nd para</p>
#> [3] <span>something new</span>
#> [4] <p>3rd para</p>
#> [5] <span>extra stuff</span>
#> [6] <p>4th para</p>
# With XPath
dummy_html %>% read_html() %>% html_nodes(xpath = "//span | //p")
#> {xml_nodeset (6)}
#> [1] <p>hi there</p>
#> [2] <p>2nd para</p>
#> [3] <span>something new</span>
#> [4] <p>3rd para</p>
#> [5] <span>extra stuff</span>
#> [6] <p>4th para</p>
由 reprex package (v0.3.0)
于 2019-10-19 创建
感谢 QHarr 指出(更简洁的)CSS 选项!
我试图在 HTML 文档中提取匹配 2 种不同类型的元素,同时保留顺序。
单独提取每种元素类型很简单(参见示例),但我不知道如何一次性提取它们并保持它们在网页中出现的顺序。
最小示例
这是一些假人 HTML
dummy_html <- "<p>hi there</p>
<p>2nd para</p>
<div>unwanted stuff</div>
<span>something new</span>
<p>3rd para</p>
<span>extra stuff</span>
<div>more unwanted stuff</div>
<p>4th para</p>"
假设我们希望提取所有 p
元素和所有 span
元素(并保持它们出现的顺序)
# p elements on their own
library(rvest)
dummy_html %>% read_html %>% html_nodes("p")
{xml_nodeset (4)}
[1] <p>hi there</p>
[2] <p>2nd para</p>
[3] <p>3rd para</p>
[4] <p>4th para</p>
# span elements on their own
dummy_html %>% read_html %>% html_nodes("span")
{xml_nodeset (2)}
[1] <span>something new</span>
[2] <span>extra stuff</span>
但是我们如何才能提取所有 either 元素?即所有 p 元素和所有 span 元素 在一起 因此所需的输出是:
{xml_nodeset (6)}
[1] <p>hi there</p>
[2] <p>2nd para</p>
[3] <span>something new</span>
[4] <p>3rd para</p>
[5] <span>extra stuff</span>
[6] <p>4th para</p>
注意顺序的保存(即p
和span
相互拼接)
到目前为止我尝试了什么
我尝试了明显的 dummy_html %>% read_html %>% html_nodes("span|p")
但它抛出了一个错误。
您可以使用 CSS 或 XPath 语法来完成;你的 CSS 只需要 ,
而不是 |
:
library(rvest)
#> Loading required package: xml2
dummy_html <- "<p>hi there</p>
<p>2nd para</p>
<div>unwanted stuff</div>
<span>something new</span>
<p>3rd para</p>
<span>extra stuff</span>
<div>more unwanted stuff</div>
<p>4th para</p>"
# With CSS
dummy_html %>% read_html() %>% html_nodes("p,span")
#> {xml_nodeset (6)}
#> [1] <p>hi there</p>
#> [2] <p>2nd para</p>
#> [3] <span>something new</span>
#> [4] <p>3rd para</p>
#> [5] <span>extra stuff</span>
#> [6] <p>4th para</p>
# With XPath
dummy_html %>% read_html() %>% html_nodes(xpath = "//span | //p")
#> {xml_nodeset (6)}
#> [1] <p>hi there</p>
#> [2] <p>2nd para</p>
#> [3] <span>something new</span>
#> [4] <p>3rd para</p>
#> [5] <span>extra stuff</span>
#> [6] <p>4th para</p>
由 reprex package (v0.3.0)
于 2019-10-19 创建感谢 QHarr 指出(更简洁的)CSS 选项!