解析来自非结构化信息框的 rvest 输出

Question

我试图使用 R 中的 rvest 包从 wiki 粉丝网站提取数据。但是，我运行遇到了几个问题，因为信息框的结构不是 HTML table。请参阅下面我处理此问题的尝试：

library(tidyverse)
library(data.table)
library(rvest)
library(httr)

url <- c("https://starwars.fandom.com/wiki/Anakin_Skywalker")

#See here that the infobox information does not appear when checking for HTML tables in the page
df <- read_html(url) %>%
  html_table()

#So now just extract data using the CSS selector
df <- read_html(url) %>%
  html_element("aside")
  html_text2()

第二次尝试确实成功提取了原始数据，但其格式化方式不容易格式化为干净的数据帧。因此，然后我尝试单独提取 table 的每个元素，这可能更容易清理和构建到数据帧中。但是，当我尝试使用 XPath 这样做时，我得到一个空结果：

df <- read_html(url) %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/aside/section[1]') %>%
  html_text2()

所以我想我的问题主要是：有没有人知道以数据帧友好格式自动提取信息框的好方法？如果没有，有人能告诉我为什么我尝试单独提取每个面板不起作用吗？

Answer 1

如果您直接定位 div.pi-data，您可以这样做：

bind_rows(
  read_html(url) %>%
    rvest::html_nodes("div.pi-data") %>% 
    map(.f = ~tibble(
      label = html_elements(.x, ".pi-data-label") %>% html_text2(),
      text= html_elements(.x, ".pi-data-value") %>% html_text2() %>% strsplit(split="\n")
    ) %>% unnest(text)
    )
)

输出：

# A tibble: 29 x 2
   label      text                                                              
   <chr>      <chr>                                                             
 1 Homeworld  Tatooine[1]                                                       
 2 Born       41 BBY,[2] Tatooine[3]                                            
 3 Died       4 ABY,[4]DS-2 Death Star II Mobile Battle Station, Endor system[5]
 4 Species    Human[1]                                                          
 5 Gender     Male[1]                                                           
 6 Height     1.88 meters,[1] later 2.03 meters (6 ft, 8 in) in armor[6]        
 7 Mass       120 kilograms in armor[7]                                         
 8 Hair color Blond,[8] light[9] and dark[10]                                   
 9 Eye color  Blue,[11] later yellow (dark side)[12]                            
10 Skin color Light,[11] later pale[5]                                          
# ... with 19 more rows

解析来自非结构化信息框的 rvest 输出

Parsing rvest output from an unstructured infobox

html

r

web-scraping

rvest