有没有更好的方法在 R 中抓取维基百科页面?
Is there a better way to scrape a wikipedia page in R?
我正在处理一个包含美国各州的数据集,现在我试图抓取维基百科页面“美国州长名单”以区分民主和共和党国家。
到目前为止,我的代码如下所示:
library(tidyverse)
library(dplyr)
library(tidyr)
library(readr)
library(rvest)
library(htmltab)
library(lubridate)
corona_usa_simple <- readr::read_csv("https://raw.githubusercontent.com/datasets/covid-19/master/data/us_simplified.csv")
corona_us_states <- corona_usa_simple %>%
select(- FIPS, - Admin2, -`Country/Region`) %>% rename(State=`Province/State`)
wiki_govenors <- htmltab("https://en.wikipedia.org/wiki/List_of_United_States_governors") %>% rename(State=`Democratic(24) Republican(26) >> State`)
现在,在我合并数据集之前,我想重命名第一列,以便它在两组数据中都显示为“State”。但不知何故,我收到一条错误消息:“无法重命名不存在的列。”
是否有更好的方法来抓取维基页面,以便不是每一列都以“`Democratic(24) Republican(26)”开头?
您可以在 htmltab()
调用中指定 header
列。这会正确命名列,但在第一行中包含“Democratic(24) Republican(26)”。要删除它,请使用 dplyr
中的 slice(-1)
。
wiki_govenors <- htmltab("https://en.wikipedia.org/wiki/List_of_United_States_governors",
header = 2) %>% slice(-1)
结果数据:
head(wiki_governors)
State Governor Party Party.1 Born
1 Alabama Kay Ivey  Republican October 15, 1944 (age 75)
2 Alaska Mike Dunleavy  Republican May 5, 1961 (age 59)
3 Arizona Doug Ducey  Republican April 9, 1964 (age 56)
4 Arkansas Asa Hutchinson  Republican December 3, 1950 (age 69)
5 California Gavin Newsom  Democratic October 10, 1967 (age 52)
6 Colorado Jared Polis  Democratic May 12, 1975 (age 45)
Prior public experience
1 Lieutenant Governor, Treasurer
2 Alaska Senate
3 Treasurer
4 Under Secretary of Homeland Security for Border & Transportation Security, Administrator of the Drug Enforcement Administration, U.S. House, U.S. Attorney
5 Lieutenant Governor, Mayor of San Francisco
6 U.S. House, Colorado State Board of Education
Inauguration End of term Past governors
1 April 10, 2017 2023 List
2 December 3, 2018 2022 List
3 January 5, 2015 2023 (term limits) List
4 January 13, 2015 2023 (term limits) List
5 January 7, 2019 2023 List
6 January 8, 2019 2023 List
我正在处理一个包含美国各州的数据集,现在我试图抓取维基百科页面“美国州长名单”以区分民主和共和党国家。
到目前为止,我的代码如下所示:
library(tidyverse)
library(dplyr)
library(tidyr)
library(readr)
library(rvest)
library(htmltab)
library(lubridate)
corona_usa_simple <- readr::read_csv("https://raw.githubusercontent.com/datasets/covid-19/master/data/us_simplified.csv")
corona_us_states <- corona_usa_simple %>%
select(- FIPS, - Admin2, -`Country/Region`) %>% rename(State=`Province/State`)
wiki_govenors <- htmltab("https://en.wikipedia.org/wiki/List_of_United_States_governors") %>% rename(State=`Democratic(24) Republican(26) >> State`)
现在,在我合并数据集之前,我想重命名第一列,以便它在两组数据中都显示为“State”。但不知何故,我收到一条错误消息:“无法重命名不存在的列。” 是否有更好的方法来抓取维基页面,以便不是每一列都以“`Democratic(24) Republican(26)”开头?
您可以在 htmltab()
调用中指定 header
列。这会正确命名列,但在第一行中包含“Democratic(24) Republican(26)”。要删除它,请使用 dplyr
中的 slice(-1)
。
wiki_govenors <- htmltab("https://en.wikipedia.org/wiki/List_of_United_States_governors",
header = 2) %>% slice(-1)
结果数据:
head(wiki_governors)
State Governor Party Party.1 Born
1 Alabama Kay Ivey  Republican October 15, 1944 (age 75)
2 Alaska Mike Dunleavy  Republican May 5, 1961 (age 59)
3 Arizona Doug Ducey  Republican April 9, 1964 (age 56)
4 Arkansas Asa Hutchinson  Republican December 3, 1950 (age 69)
5 California Gavin Newsom  Democratic October 10, 1967 (age 52)
6 Colorado Jared Polis  Democratic May 12, 1975 (age 45)
Prior public experience
1 Lieutenant Governor, Treasurer
2 Alaska Senate
3 Treasurer
4 Under Secretary of Homeland Security for Border & Transportation Security, Administrator of the Drug Enforcement Administration, U.S. House, U.S. Attorney
5 Lieutenant Governor, Mayor of San Francisco
6 U.S. House, Colorado State Board of Education
Inauguration End of term Past governors
1 April 10, 2017 2023 List
2 December 3, 2018 2022 List
3 January 5, 2015 2023 (term limits) List
4 January 13, 2015 2023 (term limits) List
5 January 7, 2019 2023 List
6 January 8, 2019 2023 List