如何使用 rvest 分离 html_text 结果?

how to separate html_text result using rvest?

我正在尝试从 google 学者网页中抓取信息:

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science

library(rvest)

htmlfile<-"https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science"

g_interest<- read_html(htmlfile) %>% html_nodes("div.gsc_oai_int") %>% html_text()

我得到以下结果:

 [1] "Quantum Chemistry Electronic Structure Condensed Matter Physics Materials Science Nanotechnology "                   
 [2] "density functional theory first principles calculations many body theory condensed matter physics materials science "
 [3] "chemistry materials science physics nanotechnology "                                                                 
 [4] "Materials Science Nanotechnology Chemistry Physics "                                                                 
 [5] "Physics Theoretical Physics Condensed Matter Theory Materials Science Nanoscience "                                  
 [6] "Materials Science Quantum Chemistry Fiber Optic Sensors Geophysics "                                                 
 [7] "Chemical Physics Condensed Matter Materials Science Magnetic Properties NMR "                                        
 [8] "Materials Science "                                                                                                  
 [9] "Materials Science Physics "                                                                                          
[10] "Physics Materials Science Theoretical Physics Nanoscience "                                                          

但是,我想得到这样的结果:

[1]"Quantum Chemistry; Electronic Structure;Condensed Matter Physics; Materials Science; Nanotechnology " 
......

关于如何用“;”分隔结果有什么建议吗?

您可以使用 purrrstringr 包,首先提取所有节点,然后连接各个节点。

library(rvest)
library(purrr)
library(stringr)

htmlfile<-"https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science"

content_nodes<- read_html(htmlfile) %>% html_nodes("div.gsc_oai_int")

map_chr(content_nodes,~.x %>%
        html_nodes(".gsc_oai_one_int") %>%
        html_text() %>%
        str_c(collapse = ";"))

结果:

[1] "Quantum Chemistry;Electronic Structure;Condensed Matter Physics;Materials Science;Nanotechnology"                   
[2] "density functional theory;first principles calculations;many body theory;condensed matter physics;materials science"
[3] "chemistry;materials science;physics;nanotechnology"                                                                 
[4] "Materials Science;Nanotechnology;Chemistry;Physics"                                                                 
[5] "Physics;Theoretical Physics;Condensed Matter Theory;Materials Science;Nanoscience"                                  
[6] "Materials Science;Quantum Chemistry;Fiber Optic Sensors;Geophysics"                                                 
[7] "Chemical Physics;Condensed Matter;Materials Science;Magnetic Properties;NMR"                                        
[8] "Materials Science"                                                                                                  
[9] "Materials Science;Physics"                                                                                          
[10] "Physics;Materials Science;Theoretical Physics;Nanoscience"