使用 rvest 从数据框列提交 URL
Submit URLs from a data frame column using rvest
我有一个名为 dogs
的数据框,如下所示:
url
https://en.wikipedia.org/wiki/Dog
https://en.wikipedia.org/wiki/Dingo
https://en.wikipedia.org/wiki/Canis_lupus_dingo
我想将所有网址提交给 rvest,但我不确定如何提交
我试过了
dogstext <-html(dogs$url) %>%
html_nodes("p:nth-child(4)") %>%
html_text()
但我遇到了这个错误
Error in UseMethod("parse") :
no applicable method for 'parse' applied to an object of class "factor"
如错误所述,解析前需要将因子列转换为字符:
dogs$url<-as.character(dogs$url)
然后是您的代码。
更新:
dog<-data.frame(url=c("https://en.wikipedia.org/wiki/Dog","https://en.wikipedia.org/wiki/Dingo","https://en.wikipedia.org/wiki/Canis_lupus_dingo"))
> str(dog)
'data.frame': 3 obs. of 1 variable:
$ url: Factor w/ 3 levels "https://en.wikipedia.org/wiki/Canis_lupus_dingo",..: 3 2 1
> lapply(as.character(dog$url),function(i)dogstext <-html(i) %>%
html_nodes("p:nth-child(4)") %>%
html_text() )
[[1]]
[1] "The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated canid which has been selectively bred for millennia for various behaviors, sensory capabilities, and physical attributes.[2] The global dog population is estimated to between 700 million[3] to over one billion, thus making the dog the most abundant member of order Carnivora.[4]"
[[2]]
[1] "The dingo's habitat ranges from deserts to grasslands and the edges of forests. Dingoes will normally make their dens in deserted rabbit holes and hollow logs close to an essential supply of water."
[[3]]
character(0)
您还可以一直使用管道 (%>%
) 惯用语,并且(如果需要)将包含提取文本的列追加回原始数据框或将其保留为矢量。下面的方法也使代码更具可读性。
library(rvest)
library(dplyr)
dog <- data.frame(url=c("https://en.wikipedia.org/wiki/Dog",
"https://en.wikipedia.org/wiki/Dingo",
"https://en.wikipedia.org/wiki/Canis_lupus_dingo"))
# this keeps the code clean and readable and testable
extract <- function(x, css) {
# this catches retrieval errors
pg <- try(html(x), silent=TRUE)
# if any retrieval error, return NA
if (inherits(pg, "try-error")) { return(NA) }
pg %>%
html_nodes(css) %>%
html_text -> element
# if there is no matching element the resule will be a 0 length list
# which will prevent sapply from simplifying it, so test for that here
element <- ifelse(length(element) == 0, NA, element)
element
}
# add as a column to the original data frame
dog %>% mutate(text=sapply(as.character(url), extract, "p:nth-child(4)")) -> dog
glimpse(dog)
## Observations: 3
## Variables:
## $ url (fctr) https://en.wikipedia.org/wiki/Dog, https://en.wikipedia....
## $ text (chr) "The domestic dog (Canis lupus familiaris or Canis famili...
# or just get it out as a separate vector
dog$url %>%
as.character %>%
sapply(extract, "p:nth-child(4)")
## https://en.wikipedia.org/wiki/Dog
## "The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated canid which has been selectively bred for millennia for various behaviors, sensory capabilities, and physical attributes.[2] The global dog population is estimated to between 700 million[3] to over one billion, thus making the dog the most abundant member of order Carnivora.[4]"
## https://en.wikipedia.org/wiki/Dingo
## "The dingo's habitat ranges from deserts to grasslands and the edges of forests. Dingoes will normally make their dens in deserted rabbit holes and hollow logs close to an essential supply of water."
## https://en.wikipedia.org/wiki/Canis_lupus_dingo
## NA