有没有更好的方法在 R 中抓取维基百科页面?

Is there a better way to scrape a wikipedia page in R?

我正在处理一个包含美国各州的数据集,现在我试图抓取维基百科页面“美国州长名单”以区分民主和共和党国家。

到目前为止,我的代码如下所示:

library(tidyverse)
library(dplyr)
library(tidyr)
library(readr)
library(rvest)
library(htmltab)
library(lubridate)

corona_usa_simple <- readr::read_csv("https://raw.githubusercontent.com/datasets/covid-19/master/data/us_simplified.csv")

corona_us_states <- corona_usa_simple %>% 
select(- FIPS, - Admin2, -`Country/Region`) %>%  rename(State=`Province/State`)

wiki_govenors <- htmltab("https://en.wikipedia.org/wiki/List_of_United_States_governors") %>% rename(State=`Democratic(24)  Republican(26) >> State`)

现在,在我合并数据集之前,我想重命名第一列,以便它在两组数据中都显示为“State”。但不知何故,我收到一条错误消息:“无法重命名不存在的列。” 是否有更好的方法来抓取维基页面,以便不是每一列都以“`Democratic(24) Republican(26)”开头?

您可以在 htmltab() 调用中指定 header 列。这会正确命名列,但在第一行中包含“Democratic(24) Republican(26)”。要删除它,请使用 dplyr 中的 slice(-1)

wiki_govenors <- htmltab("https://en.wikipedia.org/wiki/List_of_United_States_governors",
 header = 2) %>% slice(-1)

结果数据:

head(wiki_governors)

       State       Governor Party    Party.1                       Born
1    Alabama       Kay Ivey      Republican October 15, 1944 (age 75)
2     Alaska  Mike Dunleavy      Republican      May 5, 1961 (age 59)
3    Arizona     Doug Ducey      Republican    April 9, 1964 (age 56)
4   Arkansas Asa Hutchinson      Republican December 3, 1950 (age 69)
5 California   Gavin Newsom      Democratic October 10, 1967 (age 52)
6   Colorado    Jared Polis      Democratic     May 12, 1975 (age 45)
                                                                                                                                     Prior public experience
1                                                                                                                             Lieutenant Governor, Treasurer
2                                                                                                                                              Alaska Senate
3                                                                                                                                                  Treasurer
4 Under Secretary of Homeland Security for Border & Transportation Security, Administrator of the Drug Enforcement Administration, U.S. House, U.S. Attorney
5                                                                                                                Lieutenant Governor, Mayor of San Francisco
6                                                                                                              U.S. House, Colorado State Board of Education
      Inauguration        End of term Past governors
1   April 10, 2017               2023           List
2 December 3, 2018               2022           List
3  January 5, 2015 2023 (term limits)           List
4 January 13, 2015 2023 (term limits)           List
5  January 7, 2019               2023           List
6  January 8, 2019               2023           List