Rvest html_table 错误 - out[j + k, ] 错误:下标越界
Rvest html_table error - Error in out[j + k, ] : subscript out of bounds
我收到一条我无法理解的错误消息。我的代码:
url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(url)
testdata <- leg %>%
html_nodes('table') %>%
.[6] %>%
html_table()
我得到的回复是:
Error in out[j + k, ] : subscript out of bounds
当我将 html_table 换成 html_text 时,我没有收到错误。知道我做错了什么吗?
为什么不更好地瞄准 table?
library(rvest)
wp_url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(wp_url)
html_node(leg, xpath=".//table[contains(., 'District')]") %>%
html_table()
## Position Position Name Party District
## 1 Lieutenant Governor Gavin Newsom Democratic
## 2 President pro tempore Kevin de León Democratic 24th–Los Angeles
## 3 Majority leader Bill Monning Democratic 17th–Carmel
## 4 Majority whip Nancy Skinner Democratic 9th–Berkeley
## 5 Majority caucus chair Connie Leyva Democratic 20th–Chino
## 6 Majority caucus vice chair Mike McGuire Democratic 2nd–Healdsburg
## 7 Minority leader Patricia Bates Republican 36th–Laguna Niguel
## 8 Minority caucus chair Jim Nielsen Republican 4th–Gerber
## 9 Minority whip Ted Gaines Republican 1st–El Dorado Hills
## 10 Secretary Secretary Daniel Alvarez Daniel Alvarez Daniel Alvarez
## 11 Sergeant-at-Arms Sergeant-at-Arms Debbie Manning Debbie Manning Debbie Manning
## 12 Chaplain Chaplain Sister Michelle Gorman Sister Michelle Gorman Sister Michelle Gorman
唉!错了table。仅仅使用这样的数字索引仍然是不明智的。我们仍然可以针对您想要更好的table:
library(rvest)
library(purrr)
wp_url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(wp_url)
target_table <- html_node(leg, xpath=".//span[@id='Members']/../following-sibling::table")
但是,rvest::html_table()
导致了错误,您绝对应该在 GH 页面上为此提交错误报告。
另一个答案中使用的 htmltab
pkg 看起来很方便(并且可以随意接受这个答案,因为它更短并且有效)。
我们将按照 old-fashioned 的方式进行操作,但需要一个辅助函数来创建更好的列名:
mcga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}
现在,我们提取 header 行和数据行:
header_row <- html_node(target_table, xpath=".//tr[th]")
data_rows <- html_nodes(target_table, xpath=".//tr[td]")
我们查看 header 行,发现里面有一个邪恶的 colspan
。我们稍后会用到这些知识。
html_children(header_row)
## {xml_nodeset (6)}
## [1] <th scope="col" width="30" colspan="2">District</th>
## [2] <th scope="col" width="170">Name</th>
## [3] <th scope="col" width="70">Party</th>
## [4] <th scope="col" width="130">Residence</th>
## [5] <th scope="col" width="60">Term-limited?</th>
## [6] <th scope="col" width="370">Notes</th>
获取列名,并使它们整齐:
html_children(header_row) %>%
html_text() %>%
tolower() %>%
mcga() -> col_names
现在,遍历行,提取值,删除多余的第一个值并将整个东西变成一个数据框:
map_df(data_rows, ~{
kid_txt <- html_children(.x) %>% html_text()
as.list(setNames(kid_txt[-1], col_names))
})
## # A tibble: 40 x 6
## district name party residence term_limited notes
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Ted Gaines Republican El Dorado Hills
## 2 2 Mike McGuire Democratic Healdsburg
## 3 3 Bill Dodd Democratic Napa
## 4 4 Jim Nielsen Republican Gerber
## 5 5 Cathleen Galgiani Democratic Stockton
## 6 6 Richard Pan Democratic Sacramento
## 7 7 Steve Glazer Democratic Orinda
## 8 8 Tom Berryhill Republican Twain Harte Yes
## 9 9 Nancy Skinner Democratic Berkeley
## 10 10 Bob Wieckowski Democratic Fremont
## # ... with 30 more rows
library(htmltab)
library(dplyr)
library(tidyr)
url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
url %>%
htmltab(6, rm_nodata_cols = F) %>%
.[,-1] %>%
replace_na(list(Notes = "", "Term-limited?" = "")) %>%
`rownames<-` (seq_len(nrow(.)))
输出为:
District Name Party Residence Term-limited? Notes
1 1 Ted Gaines Republican El Dorado Hills
2 2 Mike McGuire Democratic Healdsburg
3 3 Bill Dodd Democratic Napa
4 4 Jim Nielsen Republican Gerber
5 5 Cathleen Galgiani Democratic Stockton
6 6 Richard Pan Democratic Sacramento
...
我收到一条我无法理解的错误消息。我的代码:
url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(url)
testdata <- leg %>%
html_nodes('table') %>%
.[6] %>%
html_table()
我得到的回复是:
Error in out[j + k, ] : subscript out of bounds
当我将 html_table 换成 html_text 时,我没有收到错误。知道我做错了什么吗?
为什么不更好地瞄准 table?
library(rvest)
wp_url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(wp_url)
html_node(leg, xpath=".//table[contains(., 'District')]") %>%
html_table()
## Position Position Name Party District
## 1 Lieutenant Governor Gavin Newsom Democratic
## 2 President pro tempore Kevin de León Democratic 24th–Los Angeles
## 3 Majority leader Bill Monning Democratic 17th–Carmel
## 4 Majority whip Nancy Skinner Democratic 9th–Berkeley
## 5 Majority caucus chair Connie Leyva Democratic 20th–Chino
## 6 Majority caucus vice chair Mike McGuire Democratic 2nd–Healdsburg
## 7 Minority leader Patricia Bates Republican 36th–Laguna Niguel
## 8 Minority caucus chair Jim Nielsen Republican 4th–Gerber
## 9 Minority whip Ted Gaines Republican 1st–El Dorado Hills
## 10 Secretary Secretary Daniel Alvarez Daniel Alvarez Daniel Alvarez
## 11 Sergeant-at-Arms Sergeant-at-Arms Debbie Manning Debbie Manning Debbie Manning
## 12 Chaplain Chaplain Sister Michelle Gorman Sister Michelle Gorman Sister Michelle Gorman
唉!错了table。仅仅使用这样的数字索引仍然是不明智的。我们仍然可以针对您想要更好的table:
library(rvest)
library(purrr)
wp_url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
leg <- read_html(wp_url)
target_table <- html_node(leg, xpath=".//span[@id='Members']/../following-sibling::table")
但是,rvest::html_table()
导致了错误,您绝对应该在 GH 页面上为此提交错误报告。
另一个答案中使用的 htmltab
pkg 看起来很方便(并且可以随意接受这个答案,因为它更短并且有效)。
我们将按照 old-fashioned 的方式进行操作,但需要一个辅助函数来创建更好的列名:
mcga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
make.unique(x, sep = "_")
}
现在,我们提取 header 行和数据行:
header_row <- html_node(target_table, xpath=".//tr[th]")
data_rows <- html_nodes(target_table, xpath=".//tr[td]")
我们查看 header 行,发现里面有一个邪恶的 colspan
。我们稍后会用到这些知识。
html_children(header_row)
## {xml_nodeset (6)}
## [1] <th scope="col" width="30" colspan="2">District</th>
## [2] <th scope="col" width="170">Name</th>
## [3] <th scope="col" width="70">Party</th>
## [4] <th scope="col" width="130">Residence</th>
## [5] <th scope="col" width="60">Term-limited?</th>
## [6] <th scope="col" width="370">Notes</th>
获取列名,并使它们整齐:
html_children(header_row) %>%
html_text() %>%
tolower() %>%
mcga() -> col_names
现在,遍历行,提取值,删除多余的第一个值并将整个东西变成一个数据框:
map_df(data_rows, ~{
kid_txt <- html_children(.x) %>% html_text()
as.list(setNames(kid_txt[-1], col_names))
})
## # A tibble: 40 x 6
## district name party residence term_limited notes
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Ted Gaines Republican El Dorado Hills
## 2 2 Mike McGuire Democratic Healdsburg
## 3 3 Bill Dodd Democratic Napa
## 4 4 Jim Nielsen Republican Gerber
## 5 5 Cathleen Galgiani Democratic Stockton
## 6 6 Richard Pan Democratic Sacramento
## 7 7 Steve Glazer Democratic Orinda
## 8 8 Tom Berryhill Republican Twain Harte Yes
## 9 9 Nancy Skinner Democratic Berkeley
## 10 10 Bob Wieckowski Democratic Fremont
## # ... with 30 more rows
library(htmltab)
library(dplyr)
library(tidyr)
url <- "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
url %>%
htmltab(6, rm_nodata_cols = F) %>%
.[,-1] %>%
replace_na(list(Notes = "", "Term-limited?" = "")) %>%
`rownames<-` (seq_len(nrow(.)))
输出为:
District Name Party Residence Term-limited? Notes
1 1 Ted Gaines Republican El Dorado Hills
2 2 Mike McGuire Democratic Healdsburg
3 3 Bill Dodd Democratic Napa
4 4 Jim Nielsen Republican Gerber
5 5 Cathleen Galgiani Democratic Stockton
6 6 Richard Pan Democratic Sacramento
...