R - rvest (webscraping) with unclosed xml nodes, here: problem with html_nodes("br")
R - rvest (webscraping) with unclosed xml nodes, here: problem with html_nodes("br")
我使用 rvest 提取网页的一部分(编辑:this webpage),代码如下:
library('rvest')
webpage <- read_html(url("https://www.tandfonline.com/action/journalInformation?show=editorialBoard&journalCode=ceas20"))
people <- webpage %>%
html_nodes(xpath='//*[@id="8af55cbd-03a5-4deb-9086-061d8da288d1"]/div/div/div') %>%
html_nodes(xpath='//p')
结果存储在名为 people
:
的 xml_nodeset 中
> people
{xml_nodeset (11)}
[1] <p> <b>Editors:</b> <br> Dr Xyz Anceschi - <i>University of Glasgow <a href="http://www.gla.ac.uk/schools/soci ...
[2] <p> <b>Editorial Board:</b> <br> Dr Xyz Aliyev - <i>University of Glasgow</i> <br> Professor Richard Berry < ...
[3] <p> <b>Board of Management:</b> <br> Professor Xyz Berry (Chair) <i>- University of Glasgow</i> <br> Profes ...
[4] <p> <b>National Advisory Board:</b> <br> Dr Xyz Badcock <i>- University of Nottingham</i> <br> Professor Cath ...
在people
中,每个元素包含关注<br>
的人的各种名字(但是,未封闭:没有</br>
)。
我试图解析每个使用此代码的人,但它不起作用:
sapply(people,
function(x)
{
x %>%
html_nodes("br") %>%
html_text()
}
)
它只给我一个空结果列表:
[[1]]
[1] "" ""
[[2]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[[3]]
[1] "" "" "" "" ""
[[4]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
我假设错误是基于 <br>
是 xml_nodeset 中的未闭合节点这一事实。难道是这样吗?
如果是这样,我还能做些什么来从 people
中提取每个人吗?
您可以使用 str_match_all
获取出现在 <br>
和 <i>
之间的所有名称。
unlist(sapply(stringr::str_match_all(people, '<br> (.*?)\s?-?\s<i>'),
function(x) x[, 2]))
# [1] "Dr Luca Anceschi" "Professor David J. Smith"
# [3] "Dr Huseyn Aliyev" "Professor Richard Berry"
# [5] "Dr Maud Bracke" "Dr Eamonn Butler"
# [7] "Dr Ammon Cheskin" "Dr Sai Ding"
# [9] "Professor Jane Duckett" "Professor Rick Fawn"
#...
#...
我使用 rvest 提取网页的一部分(编辑:this webpage),代码如下:
library('rvest')
webpage <- read_html(url("https://www.tandfonline.com/action/journalInformation?show=editorialBoard&journalCode=ceas20"))
people <- webpage %>%
html_nodes(xpath='//*[@id="8af55cbd-03a5-4deb-9086-061d8da288d1"]/div/div/div') %>%
html_nodes(xpath='//p')
结果存储在名为 people
:
> people
{xml_nodeset (11)}
[1] <p> <b>Editors:</b> <br> Dr Xyz Anceschi - <i>University of Glasgow <a href="http://www.gla.ac.uk/schools/soci ...
[2] <p> <b>Editorial Board:</b> <br> Dr Xyz Aliyev - <i>University of Glasgow</i> <br> Professor Richard Berry < ...
[3] <p> <b>Board of Management:</b> <br> Professor Xyz Berry (Chair) <i>- University of Glasgow</i> <br> Profes ...
[4] <p> <b>National Advisory Board:</b> <br> Dr Xyz Badcock <i>- University of Nottingham</i> <br> Professor Cath ...
在people
中,每个元素包含关注<br>
的人的各种名字(但是,未封闭:没有</br>
)。
我试图解析每个使用此代码的人,但它不起作用:
sapply(people,
function(x)
{
x %>%
html_nodes("br") %>%
html_text()
}
)
它只给我一个空结果列表:
[[1]]
[1] "" ""
[[2]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[[3]]
[1] "" "" "" "" ""
[[4]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
我假设错误是基于 <br>
是 xml_nodeset 中的未闭合节点这一事实。难道是这样吗?
如果是这样,我还能做些什么来从 people
中提取每个人吗?
您可以使用 str_match_all
获取出现在 <br>
和 <i>
之间的所有名称。
unlist(sapply(stringr::str_match_all(people, '<br> (.*?)\s?-?\s<i>'),
function(x) x[, 2]))
# [1] "Dr Luca Anceschi" "Professor David J. Smith"
# [3] "Dr Huseyn Aliyev" "Professor Richard Berry"
# [5] "Dr Maud Bracke" "Dr Eamonn Butler"
# [7] "Dr Ammon Cheskin" "Dr Sai Ding"
# [9] "Professor Jane Duckett" "Professor Rick Fawn"
#...
#...