rvest table 抓取包括链接
rvest table scraping including links
我想从维基百科抓取一些 table 数据。某些 table 列包含指向我想保留的其他文章的链接。我试过 this approach,它没有保留 URL。查看 html_table() 函数描述,我没有找到包含这些的任何选项。是否有其他软件包或方法可以做到这一点?
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
simp <- url %>%
html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table[3]') %>%
html_table()
simp <- simp[[1]]
试试这个
library(XML)
library(httr)
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
doc <- content(GET(url))
getHrefs <- function(node, encoding) {
x <- xmlChildren(node)$a
if (!is.null(x)) paste0("http://", parseURI(url)$server, xmlGetAttr(x, "href"), " | ", xmlValue(x) ) else xmlValue(xmlChildren(node)$text)
}
tab <- readHTMLTable(doc, which = 3, elFun = getHrefs)
head(tab[, 1:4])
# No. in\nseries No. in\nseason Title Directed by
# 1 1 1 http://en.wikipedia.org/wiki/Simpsons_Roasting_on_an_Open_Fire | Simpsons Roasting on an Open Fire http://en.wikipedia.org/wiki/David_Silverman_(animator) | David Silverman
# 2 2 2 http://en.wikipedia.org/wiki/Bart_the_Genius | Bart the Genius David Silverman
# 3 3 3 http://en.wikipedia.org/wiki/Homer%27s_Odyssey_(The_Simpsons) | Homer's Odyssey http://en.wikipedia.org/wiki/Wes_Archer | Wes Archer
# 4 4 4 http://en.wikipedia.org/wiki/There%27s_No_Disgrace_Like_Home | There's No Disgrace Like Home http://en.wikipedia.org/wiki/Gregg_Vanzo | Gregg Vanzo
# 5 5 5 http://en.wikipedia.org/wiki/Bart_the_General | Bart the General David Silverman
# 6 6 6 http://en.wikipedia.org/wiki/Moaning_Lisa | Moaning Lisa Wes Archer
URL 被保留并用竖线 (|
) 与文本分开。因此,您可以使用 strsplit(as.character(tab[, 3]), split = " | ", fixed = TRUE)
将其拆分。
我想从维基百科抓取一些 table 数据。某些 table 列包含指向我想保留的其他文章的链接。我试过 this approach,它没有保留 URL。查看 html_table() 函数描述,我没有找到包含这些的任何选项。是否有其他软件包或方法可以做到这一点?
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
simp <- url %>%
html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table[3]') %>%
html_table()
simp <- simp[[1]]
试试这个
library(XML)
library(httr)
url <- "http://en.wikipedia.org/wiki/List_of_The_Simpsons_episodes"
doc <- content(GET(url))
getHrefs <- function(node, encoding) {
x <- xmlChildren(node)$a
if (!is.null(x)) paste0("http://", parseURI(url)$server, xmlGetAttr(x, "href"), " | ", xmlValue(x) ) else xmlValue(xmlChildren(node)$text)
}
tab <- readHTMLTable(doc, which = 3, elFun = getHrefs)
head(tab[, 1:4])
# No. in\nseries No. in\nseason Title Directed by
# 1 1 1 http://en.wikipedia.org/wiki/Simpsons_Roasting_on_an_Open_Fire | Simpsons Roasting on an Open Fire http://en.wikipedia.org/wiki/David_Silverman_(animator) | David Silverman
# 2 2 2 http://en.wikipedia.org/wiki/Bart_the_Genius | Bart the Genius David Silverman
# 3 3 3 http://en.wikipedia.org/wiki/Homer%27s_Odyssey_(The_Simpsons) | Homer's Odyssey http://en.wikipedia.org/wiki/Wes_Archer | Wes Archer
# 4 4 4 http://en.wikipedia.org/wiki/There%27s_No_Disgrace_Like_Home | There's No Disgrace Like Home http://en.wikipedia.org/wiki/Gregg_Vanzo | Gregg Vanzo
# 5 5 5 http://en.wikipedia.org/wiki/Bart_the_General | Bart the General David Silverman
# 6 6 6 http://en.wikipedia.org/wiki/Moaning_Lisa | Moaning Lisa Wes Archer
URL 被保留并用竖线 (|
) 与文本分开。因此,您可以使用 strsplit(as.character(tab[, 3]), split = " | ", fixed = TRUE)
将其拆分。