R - 将向量传递给自定义函数 dplyr::mutate
R - pass vector to custom function to dplyr::mutate
我有以下功能,可以让我从其 URL 中抓取维基百科内容(具体内容与此问题无关)
getPageContent <- function(url) {
library(rvest)
library(magrittr)
pc <- html(url) %>%
html_node("#mw-content-text") %>%
# strip tags
html_text() %>%
# concatenate vector of texts into one string
paste(collapse = "")
pc
}
在特定 URL 上使用该函数时,这有效。
getPageContent("https://en.wikipedia.org/wiki/Balance_(game_design)")
[1] "In game design, balance is the concept and the practice of tuning a game's rules, usually with the goal of preventing any of its component systems from being ineffective or otherwise undesirable when compared to their peers. An unbalanced system represents wasted development resources at the very least, and at worst can undermine the game's entire ruleset by making impo (...)
但是,如果我想将函数传递给dplyr
来获取多个页面的内容,就会报错:
example <- data.frame(url = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
"https://en.wikipedia.org/wiki/Koncerthuset",
"https://en.wikipedia.org/wiki/Tifama_chera",
"https://en.wikipedia.org/wiki/Difference_theory"),
stringsAsFactors = FALSE
)
library(dplyr)
example <- mutate(example, content = getPageContent(url))
Error: length(url) == 1 ist nicht TRUE
In addition: Warning message:
In mutate_impl(.data, dots) :
the condition has length > 1 and only the first element will be used
看错误,我认为问题出在getPageContent
无法处理URL的向量,但我不知道如何解决它。
++++
编辑:建议的两个解决方案 - 1) 使用 rowwise()
和 2) 使用 sapply()
都很好。模拟 10 篇随机 WP 文章,第二种方法快 25%:
> system.time(
+ example <- example %>%
+ rowwise() %>%
+ mutate(content = getPageContent(url))
+ )
User System verstrichen
0.39 0.14 1.21
>
>
> system.time(
+ example$content <- unlist(lapply(example$url, getPageContent))
+ )
User System verstrichen
0.49 0.11 0.90
为什么不在 URL 向量上使用 lapply()
,而不是尝试将字符串向量传递给正在寻找单个字符串的函数:
urls = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
"https://en.wikipedia.org/wiki/Koncerthuset",
"https://en.wikipedia.org/wiki/Tifama_chera",
"https://en.wikipedia.org/wiki/Difference_theory")
然后:
content <- lapply(urls, getPageContent)
...这会返回一个列表。或者,如果您的网址已经在数据框中并且您想将内容添加为其中的新列,请使用 sapply()
,其中 returns 是向量而不是列表:
example$contents <- sapply(example$url, getPageContent)
您可以使用 rowwise()
,它会起作用
res <- example %>%
rowwise() %>%
mutate(content=getPageContent(url))
我有以下功能,可以让我从其 URL 中抓取维基百科内容(具体内容与此问题无关)
getPageContent <- function(url) {
library(rvest)
library(magrittr)
pc <- html(url) %>%
html_node("#mw-content-text") %>%
# strip tags
html_text() %>%
# concatenate vector of texts into one string
paste(collapse = "")
pc
}
在特定 URL 上使用该函数时,这有效。
getPageContent("https://en.wikipedia.org/wiki/Balance_(game_design)")
[1] "In game design, balance is the concept and the practice of tuning a game's rules, usually with the goal of preventing any of its component systems from being ineffective or otherwise undesirable when compared to their peers. An unbalanced system represents wasted development resources at the very least, and at worst can undermine the game's entire ruleset by making impo (...)
但是,如果我想将函数传递给dplyr
来获取多个页面的内容,就会报错:
example <- data.frame(url = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
"https://en.wikipedia.org/wiki/Koncerthuset",
"https://en.wikipedia.org/wiki/Tifama_chera",
"https://en.wikipedia.org/wiki/Difference_theory"),
stringsAsFactors = FALSE
)
library(dplyr)
example <- mutate(example, content = getPageContent(url))
Error: length(url) == 1 ist nicht TRUE
In addition: Warning message:
In mutate_impl(.data, dots) :
the condition has length > 1 and only the first element will be used
看错误,我认为问题出在getPageContent
无法处理URL的向量,但我不知道如何解决它。
++++
编辑:建议的两个解决方案 - 1) 使用 rowwise()
和 2) 使用 sapply()
都很好。模拟 10 篇随机 WP 文章,第二种方法快 25%:
> system.time(
+ example <- example %>%
+ rowwise() %>%
+ mutate(content = getPageContent(url))
+ )
User System verstrichen
0.39 0.14 1.21
>
>
> system.time(
+ example$content <- unlist(lapply(example$url, getPageContent))
+ )
User System verstrichen
0.49 0.11 0.90
为什么不在 URL 向量上使用 lapply()
,而不是尝试将字符串向量传递给正在寻找单个字符串的函数:
urls = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
"https://en.wikipedia.org/wiki/Koncerthuset",
"https://en.wikipedia.org/wiki/Tifama_chera",
"https://en.wikipedia.org/wiki/Difference_theory")
然后:
content <- lapply(urls, getPageContent)
...这会返回一个列表。或者,如果您的网址已经在数据框中并且您想将内容添加为其中的新列,请使用 sapply()
,其中 returns 是向量而不是列表:
example$contents <- sapply(example$url, getPageContent)
您可以使用 rowwise()
,它会起作用
res <- example %>%
rowwise() %>%
mutate(content=getPageContent(url))