Web 抓取不均匀的非 table 内容 - 标题有多个值时出现问题

Question

我正在尝试从 cricinfo 网站上的个人资料中抓取板球运动员的基本球员信息。此处给出了玩家个人资料页面的示例：https://www.espncricinfo.com/player/shaun-marsh-6683

最终，我想在 R 中编写一个函数来提取概览选项卡顶部的信息（全名、出生、年龄等），并将这些信息放入 R 中的数据框中。然后我有另一个功能，可以让我为多个感兴趣的玩家执行此操作。

但是，存在两个主要问题：第一个是并非所有玩家的概览页面上都有相同的信息类别。因此，我需要为每个球员导入类别标题（例如全名、出生、年龄等）及其相应的值。我在 R 中使用 rvest 完成了此操作，代码如下：

player_info <- content %>%
    html_nodes(".player_overview-grid") %>%
    html_nodes(".player-card-description.gray-900") %>% 
    html_text()
  
  player_cats <- content %>% 
    html_nodes(".player_overview-grid") %>% 
    html_nodes(".player-card-heading") %>% 
    html_text()

newplayer <- data.frame(player_cats, player_info)

这为大多数玩家提供了理想的结果，但是遇到了一个我无法弄清楚如何解决的问题。有些球员在给定的标题中有两个值；例如，在上面给出的 link 中，玩家有两个关系（兄弟和父亲），因此这意味着 player_cats 和 player_info 向量具有不同的长度。

请有人帮我解决这个问题。我想我需要以某种方式将类别及其值成对提取，而不是单独提取，如果这有意义的话。如果有多个条目，我很乐意提取类别中的第一个值，或者在 R 的最终数据框中多次包含类别标题。两者都可以。

对不起，如果这是一个简单的问题，我对此很陌生。非常感谢

编辑：

假设我将该函数应用到该播放器的页面https://www.espncricinfo.com/player/wes-agar-959833，则输出结果如愿以偿，因为每个类别只有一个条目。也就是说，它给了我以下数据框： seen in image 1 below, a dataframe of the information categories and their values for this player

但是，当我尝试将函数应用于列出的原始配置文件时出现问题：https://www.espncricinfo.com/player/shaun-marsh-6683。我得到一个错误，因为有 9 个类别，但有 10 个条目，因此不能使用 rbind。见图片 2、3、4。我需要找到一种方法来抓取每个值属于哪个类别，以便我可以在 R 的数据框中复制类别 header。我希望看到一个包含 10 行的数据框，'relations'在第一列中重复或具有 9 行的 df，其中 'relations' 一次，第一个值“GR Marsh”在 RH 列中。

Answer 1

一种解决方法是对每个类别使用 html_text2 和 xpath：

library(rvest)
library(dply)

url = "https://www.espncricinfo.com/player/shaun-marsh-6683"

#create an empty dataframe to store results 
df = vector() 


for(i in 1:9){
#creating xpath for each of the nine category
nod = paste0('//*[@id="main-container"]/div[1]/div/div[2]/div/div[2]/div[2]/div[1]/div/div[1]/div[', i, ']')
df1 = url %>%
  read_html() %>% 
  html_nodes(xpath =nod) %>% 
  html_text2()
#now we split the result into columns
df1= do.call(rbind, str_split(df1, "\n"))
df = rbind.data.frame(df, df1)
}

                 V1 V2                                         V3
    1     Full Name                            Shaun Edward Marsh
    2          Born    July 09, 1983, Narrogin, Western Australia
    3           Age                                      38y 147d
    4     Nicknames                                           Sos
    5 Batting Style                                 Left hand bat
    6 Bowling Style                        Slow left arm orthodox
    7  Playing Role                              Top order batter
    8        Height                                        1.84 m
    9     relations          GR Marsh (father),MR Marsh (brother)

Web 抓取不均匀的非 table 内容 - 标题有多个值时出现问题

Web scrape uneven non table content - problem when multiple values for a heading

r

web-scraping

rvest