R和xml2:如何读取不在子节点中的文本并读取信息,即使节点丢失
R and xml2: how to read text that is not in children nodes and read information even if node is missing
我使用 R
及其包 xml2
来解析 html
文档。我提取了一段 html
文件,它看起来像这样:
text <- ('<div>
<p><span class="number">1</span>First <span class="small-accent">previous</span></p>
<p><span class="number">2</span>Second <span class="accent">current</span></p>
<p><span class="number">3</span>Third </p>
<p><span class="number">4</span>Fourth <span class="small-accent">last</span> A</p>
</div>')
我的目标是从文本中提取信息并将其转换为数据框,如下所示:
number label text_of_accent type_of_accent
1 1 First previous small-accent
2 2 Second current accent
3 3 Third
4 4 Fourth A last small-accent
我尝试了以下代码:
library(xml2)
library(magrittr)
html_1 <- text %>%
read_html() %>%
xml_find_all( "//span[@class='number']")
number <- html_1 %>% xml_text()
label <- html_1 %>%
xml_parent() %>%
xml_text(trim = TRUE)
text_of_accent <- html_1 %>%
xml_siblings() %>%
xml_text()
type_of_accent <- html_1 %>%
xml_siblings() %>%
xml_attr("class")
不幸的是,label
、text_of_accent
、type_of_accent
没有像我预期的那样提取:
label
[1] "1First previous" "2Second current" "3Third" "4Fourth last A"
text_of_accent
[1] "previous" "current" "last"
type_of_accent
[1] "small-accent" "accent" "small-accent"
仅 xml2
是否可以实现我的目标,或者我需要一些额外的工具?至少可以提取 label
?
的文本片段
可以用xml2
来完成,你的label
搞砸的原因是xml_text()
找出所有的文本,包括当前节点和它的子节点,为了避免这种情况,你可以先使用xpath text()
定位当前节点的文本,然后提取它,你还需要检查是否存在一些节点并妥善处理缺失的情况:
# read in text as html and extract all p nodes as a list
lst <- read_html(text) %>% xml_find_all("//p")
lapply(lst, function(node) {
# find the first span
first_span_node = xml_find_first(node, "./span[@class='number']")
number = xml_text(first_span_node, trim = TRUE)
# use the text() to find out text nodes from the current position
label = paste0(xml_text(xml_find_all(node, "./text()")), collapse = " ")
# find the second span
accent_node = xml_find_first(first_span_node, "./following-sibling::span")
# check if the second span exists
if(length(accent_node) != 0) {
text_of_accent = xml_text(xml_find_first(accent_node, "./text()"))
type_of_accent = xml_text(xml_find_first(accent_node, "./@class"))
} else {
text_of_accent = ""
type_of_accent = ""
}
c(number = number, label = label,
text_of_accent = text_of_accent,
type_of_accent = type_of_accent)
}) %>%
do.call(rbind, .) %>% as.data.frame()
# number label text_of_accent type_of_accent
#1 1 First previous small-accent
#2 2 Second current accent
#3 3 Third
#4 4 Fourth A last small-accent
我使用 R
及其包 xml2
来解析 html
文档。我提取了一段 html
文件,它看起来像这样:
text <- ('<div>
<p><span class="number">1</span>First <span class="small-accent">previous</span></p>
<p><span class="number">2</span>Second <span class="accent">current</span></p>
<p><span class="number">3</span>Third </p>
<p><span class="number">4</span>Fourth <span class="small-accent">last</span> A</p>
</div>')
我的目标是从文本中提取信息并将其转换为数据框,如下所示:
number label text_of_accent type_of_accent
1 1 First previous small-accent
2 2 Second current accent
3 3 Third
4 4 Fourth A last small-accent
我尝试了以下代码:
library(xml2)
library(magrittr)
html_1 <- text %>%
read_html() %>%
xml_find_all( "//span[@class='number']")
number <- html_1 %>% xml_text()
label <- html_1 %>%
xml_parent() %>%
xml_text(trim = TRUE)
text_of_accent <- html_1 %>%
xml_siblings() %>%
xml_text()
type_of_accent <- html_1 %>%
xml_siblings() %>%
xml_attr("class")
不幸的是,label
、text_of_accent
、type_of_accent
没有像我预期的那样提取:
label
[1] "1First previous" "2Second current" "3Third" "4Fourth last A"
text_of_accent
[1] "previous" "current" "last"
type_of_accent
[1] "small-accent" "accent" "small-accent"
仅 xml2
是否可以实现我的目标,或者我需要一些额外的工具?至少可以提取 label
?
可以用xml2
来完成,你的label
搞砸的原因是xml_text()
找出所有的文本,包括当前节点和它的子节点,为了避免这种情况,你可以先使用xpath text()
定位当前节点的文本,然后提取它,你还需要检查是否存在一些节点并妥善处理缺失的情况:
# read in text as html and extract all p nodes as a list
lst <- read_html(text) %>% xml_find_all("//p")
lapply(lst, function(node) {
# find the first span
first_span_node = xml_find_first(node, "./span[@class='number']")
number = xml_text(first_span_node, trim = TRUE)
# use the text() to find out text nodes from the current position
label = paste0(xml_text(xml_find_all(node, "./text()")), collapse = " ")
# find the second span
accent_node = xml_find_first(first_span_node, "./following-sibling::span")
# check if the second span exists
if(length(accent_node) != 0) {
text_of_accent = xml_text(xml_find_first(accent_node, "./text()"))
type_of_accent = xml_text(xml_find_first(accent_node, "./@class"))
} else {
text_of_accent = ""
type_of_accent = ""
}
c(number = number, label = label,
text_of_accent = text_of_accent,
type_of_accent = type_of_accent)
}) %>%
do.call(rbind, .) %>% as.data.frame()
# number label text_of_accent type_of_accent
#1 1 First previous small-accent
#2 2 Second current accent
#3 3 Third
#4 4 Fourth A last small-accent