从 rvest R 中的属性中抓取名称（值）

Question

我想抓取以下网页（允许..）：

https://www.bisafans.de/pokedex/listen/numerisch.php

目标是提取如下 table：

number	name	type1	type2
001	Bisasam	Pflanze	Gift
002	...	...	...

我能够抓取 table 的编号和名称，但我无法提取类型，因为它们被隐藏为图像标题：

>img src="https://media.bisafans.de/f630aa6/typen/pflanze.png" alt="Pflanze"<

如何提取alt后面的名字？我已经尝试提取整个 table，它只提取数字和名称。另一种方法是 html_attr()，但也不起作用。

有人知道我该怎么做吗？

Answer 1

首先阅读html：

library(rvest)

res <- read_html('https://www.bisafans.de/pokedex/listen/numerisch.php')

现在提取 table:

tab <- res %>% html_table() %>% `[[`(1)

删除 table 底部没有任何 Typen 图片的 ??? 条目：

tab <- tab[tab[[2]] != '???', ]

使用 xpath 获取包含每个 Typen 的第一张图像的节点并提取它们的 alt 属性，然后将其插入 tab 中的 Typen 列

tab$Typen <- res %>% html_nodes(xpath = "//td/a[1]/img") %>% html_attr('alt')

这给你：

tab
#> # A tibble: 908 x 3
#>      Nr. Pokémon   Typen  
#>    <int> <chr>     <chr>  
#>  1     1 Bisasam   Pflanze
#>  2     2 Bisaknosp Pflanze
#>  3     3 Bisaflor  Pflanze
#>  4     4 Glumanda  Feuer  
#>  5     5 Glutexo   Feuer  
#>  6     6 Glurak    Feuer  
#>  7     7 Schiggy   Wasser 
#>  8     8 Schillok  Wasser 
#>  9     9 Turtok    Wasser 
#> 10    10 Raupy     Kaefer 
#> # ... with 898 more rows

Answer 2

经过多次试验，能够同时获得给定 Pokemon

的 Typen

首先我们将编写一个函数来遍历每个口袋妖怪的 xpath 并获取必要的信息。

f1 = function(n){
xx =  paste0('//*[@id="content"]/section/div/table/tbody/tr[', n, ']')

Pokémon = res %>% html_nodes(xpath = xx) %>% html_nodes('a') %>% html_text() %>% str_subset(".+")

type = res %>% html_nodes(xpath = xx) %>% html_nodes('a') %>% 
  html_nodes('img') %>% html_attr('alt')

dat = data.frame(Pokémon, type)
return(dat)
}

然后我们将使用lapply遍历所有xpath并得到一个列表。由于 ??? 我们将使用 tryCatch 跳过它们。

df = lapply(1:912, function(x){ 
  tryCatch(f1(x), error=function(e) NA)
  }
)
#convert to dataframe
df = do.call(rbind.data.frame, df)

最后，为了获得所需的输出，我们将使用 pivot_wider、

    df %>% group_by(Pokémon) %>% 
  mutate(n = row_number()) %>% 
  pivot_wider(
    names_from = "n", 
    names_prefix = "type_", 
    values_from = "type") %>% select_if(function(x) !(all(is.na(x)) | all(x=="")))

# A tibble: 909 x 3
# Groups:   Pokémon [909]
   Pokémon   type_1  type_2
   <chr>     <chr>   <chr> 
 1 Bisasam   Pflanze Gift  
 2 Bisaknosp Pflanze Gift  
 3 Bisaflor  Pflanze Gift  
 4 Glumanda  Feuer   NA    
 5 Glutexo   Feuer   NA

Answer 3

这是一个替代方案。这并不容易，因为 rvest 默认情况下只提取文本，并且这是硬编码到函数中的。但是因为我们确切地知道 table 应该是什么样子，我们可以迭代行 xml 节点并将每个项目放入列中：

library(rvest)
library(tidyverse)
# read html
html <- read_html("https://www.bisafans.de/pokedex/listen/numerisch.php")

html %>% 
  # select tr nodes aka rows
  html_nodes(".table tr") %>% 
  # map_df applies the function to each row and binds the results into
  # one data frame
  map_df(function(x) {

    # first exctract text
    text <- html_text(x, trim = TRUE)
    # this comes out as one string so let's split it into cells
    text <- strsplit(text, "\n")[[1]]

    # next extract alt descriptions
    type <- html_nodes(x, "img") %>% html_attr("alt")
    # if there is more then one, collapse them into one string, 
    # removing empty ones
    type <- paste0(type[type != ""], collapse = ", ")

    # combine text and alt into a vector
    out <- c(text, type[type != ""])
    # transform it to a data frame
    tibble(
      Nr = out[1],
      Pokemon   = out[2],
      Typen = out[3]
    )
  }) %>% 
  slice(-1)
#> # A tibble: 912 × 3
#>    Nr    Pokemon    Typen        
#>    <chr> <chr>      <chr>        
#>  1 001    Bisasam   Pflanze, Gift
#>  2 002    Bisaknosp Pflanze, Gift
#>  3 003    Bisaflor  Pflanze, Gift
#>  4 004    Glumanda  Feuer        
#>  5 005    Glutexo   Feuer        
#>  6 006    Glurak    Feuer, Flug  
#>  7 007    Schiggy   Wasser       
#>  8 008    Schillok  Wasser       
#>  9 009    Turtok    Wasser       
#> 10 010    Raupy     Kaefer       
#> # … with 902 more rows

^{由 reprex package (v2.0.1)}

于 2022-03-25 创建

Answer 4

使用正确的 css 选择器列表并将数据处理为嵌套 map_dfr(data.frame()) 调用中 table 行的列表，这非常简单。

在 data.frame() 中，您可以利用当 css 选择器列表与 DOM 不匹配时返回 NA 的事实，以确保列长度相等。为每个可能的列条目指定一个选择器列表。

library(tidyverse)
library(rvest)

rows <- read_html("https://www.bisafans.de/pokedex/listen/numerisch.php") %>% html_elements(".table tbody tr")

df <- map_dfr(rows, ~ data.frame(
  `Nr.` = .x %>% html_element("td:first-child") %>% html_text(),
  `Pokémon` = .x %>% html_element("a") %>% html_text(),
  `Type1` = .x %>% html_element("td:last-child > a:nth-child(odd) > img") %>% html_attr("alt"),
  `Type2` = .x %>% html_element("td:last-child > a:nth-child(even) > img") %>% html_attr("alt")
))

从 rvest R 中的属性中抓取名称（值）

Scraping name(values) from attributes in rvest R

r

web-scraping

rvest