使用 RVest 创建 HTML Table 然后使用操作和清理进入 DF

Question

我想抓取以下网页：

https://kubears.com/sports/football/stats/2021/assumption/boxscore/11837

... 具体来说，是顶部菜单中的“逐个播放”选项卡。获取信息非常简单：

library(tidyverse)
library(rvest)

url <- "https://kubears.com/sports/football/stats/2021/assumption/boxscore/11837"

page <- rvest::read_html(url)

page %>%
  html_table() %>%
  pluck(27)

... 结果是：

# A tibble: 7 x 2
  `Assumption at 15:00` `Assumption at 15:00`                                                    
  <chr>                 <chr>                                                                    
1 Down & Distance       Play                                                                     
2 Assumption at 15:00   Assumption at 15:00                                                      
3 1st and 10 at ASM19   Assumption drive start at 15:00.                                         
4 1st and 10 at ASM19   Turner,Easton rush for 2 yards loss to the ASM17 (Justice,Amani).        
5 2nd and 12 at ASM17   Exum-Strong,Khaleed rush for 9 yards gain to the ASM26 (Justice,Amani).  
6 3rd and 3 at ASM26    Turner,Easton pass incomplete to Collier,Bailey broken up by Meyers,Ryan.
7 4th and 3 at ASM26    Bertolazzo,Gabriel punt 46 yards to the KTZ28.

这就是我迷路的地方。我想获取这些信息并将其处理到不同的列中。例如，这是来自 Excel 的屏幕截图，其中我列出了我希望完成的输出看起来像的内容：

如您所见，我正在从每个单独的游戏中获取信息并将这些信息翻译成特定的栏目。是抢戏吗？是传球吗？获得了多少码？向下、距离、yardline_100等

然后在整个游戏过程中执行该过程。

任何关于如何启动流程的建议 and/or 都将不胜感激。说到 R，抓取当然不是我的核心优势。

Answer 1

这是一种使用 tidyverse 中的函数实现结果的方法。有很多不同的方法可以得到相同的结果，这只是一种方法。代码结构分为三个主要部分：首先，通过绑定多个列表的行构建一个大数据框，第二个删除原始数据框中无用的行，第三个创建所有变量。

tab 数据框也与您的 page 原始输入略有不同，请参阅数据和函数部分中的代码。我基本上更改了列名，使它们不相同，并将它们重命名为 col1 和 col2.

实际使用的只是几个不同的函数。我创建了 extract_digit，它从字符串中提取第 n 次出现的数字。 str_extract 和 str_match 从字符串中提取指定的模式，而 str_detects 仅检测（而 returns 逻辑，TRUE 或 FALSE）。 word 从字符串中获取第 n 个单词。

library(tidyverse)

tab %>% 

  # Bind rows of the list to make a big data frame
  bind_rows() %>% 

  # Define the quarter
  mutate(quarter = str_extract(paste(col1, col2), "\w+(?=\s+quarter)")) %>% 
  fill(quarter) %>% 
  
  # Keep only rows that starts with a number
  filter(str_detect(col1, '^[0-9]')) %>% 
  # Create some variables
  mutate(drive_start = str_extract(col2, '[0-9][0-9]:[0-9][0-9]')) %>% 
  fill(drive_start) %>% 

  # Remove duplicated rows in the first column, keeping the ones with the longest string
  group_by(col1) %>% 
  slice_max(nchar(col2)) %>% 
  ungroup() %>% 

  # ordering the rows
  arrange(quarter, desc(drive_start)) %>% 

  # Mutating part, creation of new variables
  mutate(posteam = ifelse(str_detect(col1, "ASM"), "ASM", "KTZ"), 
         defteam = ifelse(str_detect(col1, "ASM"), "KTZ", "ASM"),
         yardline_100 = 100 - extract_digit(col1, 3),
         down = extract_digit(col1, 1),
         ydstogo = extract_digit(col1, 2),
         ydsgained = -c(diff(ydstogo), 0),
         penalty = str_detect(col2, "PENALTY"),
         end = str_detect(col2, "End"),
         play_type = case_when(penalty ~ "penalty", 
                               end ~ "end",
                               T ~ word(col2, 2)),
         rush = +(play_type == "rush"),
         pass = +(play_type == "pass"),
         special = +(!play_type %in% c("rush", "pass")),
         passer = ifelse(pass == 1, word(col2, 1), NA),
         rusher = ifelse(rush == 1, word(col2, 1), NA), 
         tackle = str_match(col2, "(?<=\().+?(?=\))")[,1], #Get word between parentheses
         complete_pass = case_when(str_detect(col2, "incomplete") & pass == 1 ~ 0,
                                   pass == 1 ~ 1,
                                   TRUE ~ NA_real_),
         pass_breakup = word(col2, 2, sep = "broken up by "),
         punt = +(play_type == "punt"),
         punter = ifelse(punt == 1, word(col2, 1), NA)) %>% 
  select(-c(col1, col2, penalty, end))

输出

# A tibble: 128 x 19
   quarter drive_start posteam defteam yardline_100  down ydstogo ydsgained play_type  rush  pass special passer        rusher              tackle                     complete_pass pass_breakup  punt punter
   <chr>   <chr>       <chr>   <chr>          <dbl> <dbl>   <dbl>     <dbl> <chr>     <int> <int>   <int> <chr>         <chr>               <chr>                              <dbl> <chr>        <int> <chr> 
 1 1st     15:00       ASM     KTZ               81     1      10        -2 rush          1     0       0 NA            Turner,Easton       Justice,Amani                         NA NA               0 NA    
 2 1st     15:00       ASM     KTZ               83     2      12         9 rush          1     0       0 NA            Exum-Strong,Khaleed Justice,Amani                         NA NA               0 NA    
 3 1st     15:00       ASM     KTZ               74     3       3         0 pass          0     1       0 Turner,Easton NA                  NA                                     0 Meyers,Ryan.     0 NA    
 4 1st     15:00       ASM     KTZ               74     4       3        -7 punt          0     0       1 NA            NA                  NA                                    NA NA               1 Berto~
 5 1st     13:30       KTZ     ASM               72     1      10         0 rush          1     0       0 NA            Nickel,Eric         Borguet,Keelan                        NA NA               0 NA    
 6 1st     13:30       KTZ     ASM               61     1      10        -7 rush          1     0       0 NA            Davis,Jordan        Allen,Edward                          NA NA               0 NA    
 7 1st     13:30       KTZ     ASM               68     2      17        10 pass          0     1       0 Nickel,Eric   NA                  Malm,Daniel                            1 NA               0 NA    
 8 1st     13:30       KTZ     ASM               58     2       7        -1 penalty       0     0       1 NA            NA                  Omokaro,Akugbe                        NA NA               0 NA    
 9 1st     13:30       KTZ     ASM               70     2       8         6 rush          1     0       0 NA            Davis,Jordan        Omokaro,Akugbe; Wright,Tr~            NA NA               0 NA    
10 1st     13:30       KTZ     ASM               53     3       2         0 pass          0     1       0 Nickel,Eric   NA                  NA                                     0 NA               0 NA    
# ... with 118 more rows

数据与功能

extract_digit <- function(x, n) as.numeric(sapply(str_extract_all(x, "[0-9]+"), `[`, n))
# This function is a wrapper to get the nth occurrence of a number in a string.

tab <- page %>%
  html_table() %>%
  .[26:47] %>% 
  map(~ .x %>% 
        tibble(.name_repair = "universal") %>% 
        rename_with(~ str_c("col", 1:2)))

[[1]]
# A tibble: 4 x 1
  col1                                                                                  
  <chr>                                                                                 
1 Play                                                                                  
2 Kutztown wins toss and defers; ASM will receive; KTZ will defend East end-zone.       
3 Start of 1st quarter, clock 15:00.                                                    
4 Krcic,Dean kickoff 65 yards to the ASM00 Easton,Brevin return 19 yards to the ASM19 (~

[[2]]
# A tibble: 7 x 2
  col1                col2                                                              
  <chr>               <chr>                                                             
1 Down & Distance     Play                                                              
2 Assumption at 15:00 Assumption at 15:00                                               
3 1st and 10 at ASM19 Assumption drive start at 15:00.                                  
4 1st and 10 at ASM19 Turner,Easton rush for 2 yards loss to the ASM17 (Justice,Amani). 
5 2nd and 12 at ASM17 Exum-Strong,Khaleed rush for 9 yards gain to the ASM26 (Justice,A~
6 3rd and 3 at ASM26  Turner,Easton pass incomplete to Collier,Bailey broken up by Meye~
7 4th and 3 at ASM26  Bertolazzo,Gabriel punt 46 yards to the KTZ28.
# ...

Answer 2

我尝试创建了一个 extract_plays 函数，该函数基本上通过使用一系列 stringr::str_detect()、stringr::str_extract() 和 if_else() 来解析 Play 列函数。

棘手的是每个季度开始时的表格有些不一致，需要特别注意，否则代码应该相当self-explanatory。

我不是美式足球的粉丝，所以请检查我所做的一些假设。

library(tidyverse)
library(rvest)

url <- "https://kubears.com/sports/football/stats/2021/assumption/boxscore/11837"

page <- rvest::read_html(url)

# set index for teams
teams <- tibble(team = c("Assumption", "Kutztown"),
                    posteam = c("ASM", "KTZ"),
                    defteam = c("KTZ","ASM"))

# grab the play pages from table
  plays <- page %>% 
    html_table() %>% 
    .[27:47] %>% 
    # use colnames to extract teams later
    map(~rbind(colnames(.x),.x))
## note that  7th and 17th elements of plays are the 2nd and 4th quarter starts - different format
## 11th element is 2nd half start, incompatible table (not included in result)
  
# create plays extraction function
  extract_plays <- function(df){
    df <-  df %>% 
      set_names(c("Downs","Play")) %>% 
      mutate(team = str_extract(first(Downs), "^\S+"),
             drive_start = str_extract(first(Downs),"\S+$")) %>%
      filter(str_detect(Downs, "and")) %>%
      inner_join(teams)  %>% 
      extract(Downs,
              into = c("down","ydstogo","yardline_100"),
              regex = "(^\d..) and (\d+) at (.*)") %>%
      mutate(yardline_100 = ifelse(str_detect(yardline_100,defteam),
                                   parse_number(yardline_100),
                                   100 - parse_number(yardline_100))) %>%
      mutate(pass = +str_detect(Play, "pass"),
             rush = +str_detect(Play, "rush"),
             punt = +str_detect(Play, "punt"),
             special = ifelse(pass + rush == 0,1,0)) %>%
      mutate(play_type = case_when(str_detect(Play, "pass") ~ "pass",
                                   str_detect(Play, "rush") ~ "rush",
                                   str_detect(Play, "punt") ~ "punt",
                                   str_detect(Play, "field goal") ~ "fieldgoal",
                                   str_detect(Play, "kickoff") ~ "kickoff",
                                   TRUE ~ "other")) %>%
      mutate(passer = str_extract(Play, "(.*)(?=\spass)"),
             rusher = str_extract(Play, "(.*)(?=\srush)"),
             punter = str_extract(Play, "(.*)(?=\spunt)"),
             tackle = NA,
             tackle = ifelse(play_type == "pass",
                             str_extract(Play, "(?<=\().+?(?=\))"), tackle),
             tackle = ifelse(play_type == "rush",
                             str_extract(Play, "(?<=\().+?(?=\))"), tackle),
             yrds = ifelse(play_type == "rush", str_extract(Play, "(\d+\syards\s\w{4})"),NA),
             yrds = ifelse(play_type == "pass", str_extract(Play, "for\s\d+\syards"),yrds),
             yrdsgained = ifelse(str_detect(yrds,"loss"),-1*parse_number(yrds),parse_number(yrds)),
             complete_pass = case_when(str_detect(Play, "completed") ~ 1,
                                       str_detect(Play, "incomplete") ~ 0,
                                       TRUE ~ NA_real_),
             pass_breakup = str_extract(Play,"((?<=broken\sup\sby\s).*$)")
      ) %>%
      select(drive_start,posteam,defteam,yardline_100,down,ydstogo,play_type,pass,rush,special,passer,rusher,yrdsgained,tackle,complete_pass,pass_breakup,punt,punter,Play) 
    return(df)
  }
  
## Grab for each quarter 
Q1 <- plays %>% 
    .[1:6] %>% 
    map_df(extract_plays) %>% 
      mutate(Quarter = "1st", .before = drive_start) 
  
Q2 <-   plays %>%
    .[8:10] %>% 
    map_df(extract_plays) %>% 
      mutate(Quarter = "2nd", .before = drive_start)
    
Q3 <-     plays %>%
      .[12:16] %>% 
      map_df(extract_plays) %>% 
        mutate(Quarter = "3rd", .before = drive_start) 
      
Q4 <-      plays %>% 
        .[18:21] %>% 
        map_df(extract_plays) %>% 
        mutate(Quarter = "4th", .before = drive_start)


 ## Grab the special cases for 2nd and 4th quarter starts     
Q2.1 <- plays %>%
  .[[7]] %>%  rbind(c("Kutztown at 15:00","Play"),.) %>% 
  extract_plays() %>% 
  mutate(Quarter = "2nd", .before = drive_start)

Q4.1 <- plays %>%
  .[[17]] %>%  rbind(c("Kutztown at 15:00","Play"),.) %>% 
  extract_plays()%>% 
  mutate(Quarter = "4th", .before = drive_start)

## Add to Q2 and Q4
Q2 <- bind_rows(Q2.1,Q2)
Q4 <- bind_rows(Q4.1,Q4)

#final table
result <- bind_rows(list(Q1,Q2,Q3,Q4))
result

给予：

# A tibble: 170 × 20
   Quarter drive_start posteam defteam yardline_100 down  ydstogo play_type  pass  rush
   <chr>   <chr>       <chr>   <chr>          <dbl> <chr> <chr>   <chr>     <dbl> <dbl>
 1 1st     15:00       ASM     KTZ               81 1st   10      other         0     0
 2 1st     15:00       ASM     KTZ               81 1st   10      rush          0     1
 3 1st     15:00       ASM     KTZ               83 2nd   12      rush          0     1
 4 1st     15:00       ASM     KTZ               74 3rd   3       pass          1     0
 5 1st     15:00       ASM     KTZ               74 4th   3       punt          0     0
 6 1st     13:30       KTZ     ASM               72 1st   10      other         0     0
 7 1st     13:30       KTZ     ASM               72 1st   10      rush          0     1
 8 1st     13:30       KTZ     ASM               70 2nd   8       rush          0     1
 9 1st     13:30       KTZ     ASM               66 3rd   4       rush          0     1
10 1st     13:30       KTZ     ASM               61 1st   10      rush          0     1
# … with 160 more rows, and 10 more variables: special <dbl>, passer <chr>, rusher <chr>,
#   yrdsgained <dbl>, tackle <chr>, complete_pass <chr>, pass_breakup <chr>, punt <dbl>,
#   punter <chr>, Play <chr>

使用 RVest 创建 HTML Table 然后使用操作和清理进入 DF

Using RVest to Create HTML Table And Then Using Manipulating and Cleaning into DF

r

web-scraping

tidyr

rvest