执行多项统计计算的函数式编程原则
Functional programming principles to perform multiple statistical computations
我想应用一些统计计算,包括可靠性测量,例如 ICC 或变异系数。虽然我可以单独计算它们,但我还不熟悉 R 函数式编程实践来直接执行多个计算而无需太多代码重复。
考虑以下 data.frame
示例,其中包含对五个不同变量 (Var1, ... Var5
) 的重复测量 (T1, T2
):
set.seed(123)
df = data.frame(matrix(rnorm(100), nrow=10))
names(df) <- c("T1.Var1", "T1.Var2", "T1.Var3", "T1.Var4", "T1.Var5",
"T2.Var1", "T2.Var2", "T2.Var3", "T2.Var4", "T2.Var5")
如果我想计算每个变量的两次重复测量之间的类内相关系数,我可以:1) 创建函数 returns:ICC,下限和上限值:
calcula_ICC <- function(a, b) {
ICc <- ICC(matrix(c(a,b), ncol = 2))
icc <- ICc$results[[2]] [3]
lo <- ICc$results[[7]] [3]
up <- ICc$results[[8]] [3]
round(c(icc, lo, up),2)
}
和 2) 将其应用于每个相应的变量,如下所示:
calcula_ICC(df$T1.Var1, df$T2.Var1)
calcula_ICC(df$T1.Var2, df$T2.Var2)
calcula_ICC(df$T1.Var3, df$T2.Var3)
calcula_ICC(df$T1.Var4, df$T2.Var4)
calcula_ICC(df$T1.Var5, df$T2.Var5)
然后我将对每个变量进行类似的其他统计计算,例如重复测量之间的变异系数或标准误差。
但是,怎么可能会用到一些函数式编程的原理呢?例如,我如何创建一个函数,将 T1
和 T2
上的每个相应变量以及所需的函数作为参数?
将有很多方法可以解决这个问题,我没有时间 post 所有方法,但我可能会回来添加一个 lapply
解决方案,因为apply
函数在 R 中非常重要。
使用 dplyr
和 tidyr
这里有一个 dplyr
和 tidyr
的解决方案,可能会有帮助:
require(dplyr)
require(tidyr)
# let's have a function for each value you want eventually
GetICC <- function(x, y) {
require(psych)
ICC(matrix(c(x, y), ncol = 2))$results[[2]][3]
}
GetICCLo <- function(x, y) {
require(psych)
ICC(matrix(c(x, y), ncol = 2))$results[[7]][3]
}
GetICCUp <- function(x, y) {
require(psych)
ICC(matrix(c(x, y), ncol = 2))$results[[8]][3]
}
# tidy up your data, take a look at what this looks like
mydata <- df %>%
mutate(id = row_number()) %>%
gather(key = time, value = value, -id) %>%
separate(time, c("Time", "Var")) %>%
spread(key = Time, value = value)
# group by variable, then run your functions
# notice I added mean difference between the two
# times as an example of how you can extend this
# to include whatever summaries you need
myresults <- mydata %>%
group_by(Var) %>%
summarize(icc = GetICC(T1, T2),
icc_lo = GetICCLo(T1, T2),
icc_up = GetICCUp(T1, T2),
mean_diff = mean(T2) - mean(T1))
只要您传递给汇总的所有内容都在同一级别aggregate/calculate,这就很好用。
函数式编程方法是使用mapply
。不需要 "tidying":
result = mapply(calcula_ICC, df[, 1:5], df[, 6:10], USE.NAMES=FALSE)
colnames(result) = paste0('Var', 1:5)
# Better than setting rownames here is to have calcula_ICC() return a named vector
rownames(result) = c('icc','lo','up')
> result
# Var1 Var2 Var3 Var4 Var5
# icc 0.09 0.08 -0.37 -0.23 -0.17
# lo -0.54 -0.55 -0.80 -0.73 -0.70
# up 0.66 0.65 0.29 0.43 0.48
(注意结果是矩阵)
我想应用一些统计计算,包括可靠性测量,例如 ICC 或变异系数。虽然我可以单独计算它们,但我还不熟悉 R 函数式编程实践来直接执行多个计算而无需太多代码重复。
考虑以下 data.frame
示例,其中包含对五个不同变量 (Var1, ... Var5
) 的重复测量 (T1, T2
):
set.seed(123)
df = data.frame(matrix(rnorm(100), nrow=10))
names(df) <- c("T1.Var1", "T1.Var2", "T1.Var3", "T1.Var4", "T1.Var5",
"T2.Var1", "T2.Var2", "T2.Var3", "T2.Var4", "T2.Var5")
如果我想计算每个变量的两次重复测量之间的类内相关系数,我可以:1) 创建函数 returns:ICC,下限和上限值:
calcula_ICC <- function(a, b) {
ICc <- ICC(matrix(c(a,b), ncol = 2))
icc <- ICc$results[[2]] [3]
lo <- ICc$results[[7]] [3]
up <- ICc$results[[8]] [3]
round(c(icc, lo, up),2)
}
和 2) 将其应用于每个相应的变量,如下所示:
calcula_ICC(df$T1.Var1, df$T2.Var1)
calcula_ICC(df$T1.Var2, df$T2.Var2)
calcula_ICC(df$T1.Var3, df$T2.Var3)
calcula_ICC(df$T1.Var4, df$T2.Var4)
calcula_ICC(df$T1.Var5, df$T2.Var5)
然后我将对每个变量进行类似的其他统计计算,例如重复测量之间的变异系数或标准误差。
但是,怎么可能会用到一些函数式编程的原理呢?例如,我如何创建一个函数,将 T1
和 T2
上的每个相应变量以及所需的函数作为参数?
将有很多方法可以解决这个问题,我没有时间 post 所有方法,但我可能会回来添加一个 lapply
解决方案,因为apply
函数在 R 中非常重要。
使用 dplyr
和 tidyr
这里有一个 dplyr
和 tidyr
的解决方案,可能会有帮助:
require(dplyr)
require(tidyr)
# let's have a function for each value you want eventually
GetICC <- function(x, y) {
require(psych)
ICC(matrix(c(x, y), ncol = 2))$results[[2]][3]
}
GetICCLo <- function(x, y) {
require(psych)
ICC(matrix(c(x, y), ncol = 2))$results[[7]][3]
}
GetICCUp <- function(x, y) {
require(psych)
ICC(matrix(c(x, y), ncol = 2))$results[[8]][3]
}
# tidy up your data, take a look at what this looks like
mydata <- df %>%
mutate(id = row_number()) %>%
gather(key = time, value = value, -id) %>%
separate(time, c("Time", "Var")) %>%
spread(key = Time, value = value)
# group by variable, then run your functions
# notice I added mean difference between the two
# times as an example of how you can extend this
# to include whatever summaries you need
myresults <- mydata %>%
group_by(Var) %>%
summarize(icc = GetICC(T1, T2),
icc_lo = GetICCLo(T1, T2),
icc_up = GetICCUp(T1, T2),
mean_diff = mean(T2) - mean(T1))
只要您传递给汇总的所有内容都在同一级别aggregate/calculate,这就很好用。
函数式编程方法是使用mapply
。不需要 "tidying":
result = mapply(calcula_ICC, df[, 1:5], df[, 6:10], USE.NAMES=FALSE)
colnames(result) = paste0('Var', 1:5)
# Better than setting rownames here is to have calcula_ICC() return a named vector
rownames(result) = c('icc','lo','up')
> result
# Var1 Var2 Var3 Var4 Var5
# icc 0.09 0.08 -0.37 -0.23 -0.17
# lo -0.54 -0.55 -0.80 -0.73 -0.70
# up 0.66 0.65 0.29 0.43 0.48
(注意结果是矩阵)