使用 dplyr 从多列计算分位数
Using dplyr to calculate quantile from multiple columns
我有一个这样的数据框
set.seed(123)
对于一个向量,如果我想生成均值,以及上下95%CI,我可以这样做:
x <- rnorm(20)
quantile(x, probs = 0.500) # mean
quantile(x, probs = 0.025) # lower
quantile(x, probs = 0.975) # upper bound
我有一个数据框
df <- data.frame(loc = rep(1:2, each = 4),
year = rep(1980:1983, times = 2),
x1 = rnorm(8), x2 = rnorm(8), x3 = rnorm(8), x4 = rnorm(8),
x5 = rnorm(8), x6 = rnorm(8), x7 = rnorm(8), x8 = rnorm(8))
对于每个位置和年份,我想使用 x1 到 x8 找到中值、下限和上限。
df %>% group_by(loc, year) %>%
dplyr::summarise(mean.x = quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.500),
lower.x = quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.025),
upper.x = quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.975))
但这给了我相同的答案。
# A tibble: 8 x 5
# Groups: loc [?]
loc year mean.x lower.x upper.x
<int> <int> <dbl> <dbl> <dbl>
1 1 1980 -1.07 -1.07 -1.07
2 1 1981 -0.218 -0.218 -0.218
3 1 1982 -1.03 -1.03 -1.03
4 1 1983 -0.729 -0.729 -0.729
5 2 1980 -0.625 -0.625 -0.625
6 2 1981 -1.69 -1.69 -1.69
7 2 1982 0.838 0.838 0.838
8 2 1983 0.153 0.153 0.153
此外,有什么方法可以不通过 x1、x2...x8 来引用列,而是可以通过索引来做类似
的事情
3:ncol(df)
您可能希望先将宽数据转换为长数据:
require(dplyr)
require(tidyr)
df %>% gather(xvar, value, x1:x8) %>%
group_by(loc, year) %>%
summarise(mean.x = quantile(value, probs = 0.50),
lower.x = quantile(value, probs = 0.025),
upper.x = quantile(value, probs = 0.975))
你得到:
# A tibble: 8 x 5
# Groups: loc [?]
loc year mean.x lower.x upper.x
<int> <int> <dbl> <dbl> <dbl>
1 1 1980 0.152 -0.982 2.08
2 1 1981 -0.478 -1.33 0.825
3 1 1982 -0.0415 -1.95 1.02
4 1 1983 0.855 -0.180 1.43
5 2 1980 0.658 -1.24 2.23
6 2 1981 0.196 -0.782 0.827
7 2 1982 -0.629 -0.937 0.285
8 2 1983 -0.0737 -0.744 1.27
函数quantile
只需要一个输入向量。当你做
quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.5)
你给它提供了 8 个输入向量,它只会使用 x1
而忽略 x2
到 x8
。
示例:
x <- rnorm(20)
y = rnorm(20) + 100
quantile(x, probs = 0.025) # lower
# 2.5%
# -1.633378
quantile(x, y, probs = 0.025) # y will be ignored. This yields same result as quantile(x, probs = 0.025). A warning explains this
# 2.5%
# -1.633378
# Warning message:
# In if (na.rm) x <- x[!is.na(x)] else if (anyNA(x)) stop("missing values and NaN's not allowed if 'na.rm' is FALSE") :
# the condition has length > 1 and only the first element will be used
要解决您的具体问题,请将 x1
到 x8
放在 c()
中以形成向量:
df %>% group_by(loc, year) %>%
dplyr::summarise(lower.x = quantile(c(x1, x2, x3, x4, x5, x6 , x7, x8), probs = 0.025),
mean.x = quantile(c(x1, x2, x3, x4, x5, x6 , x7, x8), probs = 0.5),
upper.x = quantile(c(x1, x2, x3, x4, x5, x6 , x7, x8), probs = 0.975))
产量:
# A tibble: 8 x 5
# Groups: loc [?]
loc year lower.x mean.x upper.x
<int> <int> <dbl> <dbl> <dbl>
1 1 1980 -1.12583212 0.1683845 1.1579655
2 1 1981 -1.20363611 -0.1399433 1.9308253
3 1 1982 -0.93238412 -0.3195850 0.3835611
4 1 1983 -2.08331501 -0.4235632 1.2267823
5 2 1980 -1.46528453 -0.3096375 0.9863813
6 2 1981 -1.51563211 0.1100798 0.8267675
7 2 1982 -1.16435350 0.1885864 0.8349510
8 2 1983 -0.01427533 0.4301591 1.9688637
顺便说一句,上限应该是 0.975,你打错了 0.0975
我有一个这样的数据框
set.seed(123)
对于一个向量,如果我想生成均值,以及上下95%CI,我可以这样做:
x <- rnorm(20)
quantile(x, probs = 0.500) # mean
quantile(x, probs = 0.025) # lower
quantile(x, probs = 0.975) # upper bound
我有一个数据框
df <- data.frame(loc = rep(1:2, each = 4),
year = rep(1980:1983, times = 2),
x1 = rnorm(8), x2 = rnorm(8), x3 = rnorm(8), x4 = rnorm(8),
x5 = rnorm(8), x6 = rnorm(8), x7 = rnorm(8), x8 = rnorm(8))
对于每个位置和年份,我想使用 x1 到 x8 找到中值、下限和上限。
df %>% group_by(loc, year) %>%
dplyr::summarise(mean.x = quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.500),
lower.x = quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.025),
upper.x = quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.975))
但这给了我相同的答案。
# A tibble: 8 x 5
# Groups: loc [?]
loc year mean.x lower.x upper.x
<int> <int> <dbl> <dbl> <dbl>
1 1 1980 -1.07 -1.07 -1.07
2 1 1981 -0.218 -0.218 -0.218
3 1 1982 -1.03 -1.03 -1.03
4 1 1983 -0.729 -0.729 -0.729
5 2 1980 -0.625 -0.625 -0.625
6 2 1981 -1.69 -1.69 -1.69
7 2 1982 0.838 0.838 0.838
8 2 1983 0.153 0.153 0.153
此外,有什么方法可以不通过 x1、x2...x8 来引用列,而是可以通过索引来做类似
的事情3:ncol(df)
您可能希望先将宽数据转换为长数据:
require(dplyr)
require(tidyr)
df %>% gather(xvar, value, x1:x8) %>%
group_by(loc, year) %>%
summarise(mean.x = quantile(value, probs = 0.50),
lower.x = quantile(value, probs = 0.025),
upper.x = quantile(value, probs = 0.975))
你得到:
# A tibble: 8 x 5
# Groups: loc [?]
loc year mean.x lower.x upper.x
<int> <int> <dbl> <dbl> <dbl>
1 1 1980 0.152 -0.982 2.08
2 1 1981 -0.478 -1.33 0.825
3 1 1982 -0.0415 -1.95 1.02
4 1 1983 0.855 -0.180 1.43
5 2 1980 0.658 -1.24 2.23
6 2 1981 0.196 -0.782 0.827
7 2 1982 -0.629 -0.937 0.285
8 2 1983 -0.0737 -0.744 1.27
函数quantile
只需要一个输入向量。当你做
quantile(x1, x2, x3, x4, x5, x6 , x7, x8, probs = 0.5)
你给它提供了 8 个输入向量,它只会使用 x1
而忽略 x2
到 x8
。
示例:
x <- rnorm(20)
y = rnorm(20) + 100
quantile(x, probs = 0.025) # lower
# 2.5%
# -1.633378
quantile(x, y, probs = 0.025) # y will be ignored. This yields same result as quantile(x, probs = 0.025). A warning explains this
# 2.5%
# -1.633378
# Warning message:
# In if (na.rm) x <- x[!is.na(x)] else if (anyNA(x)) stop("missing values and NaN's not allowed if 'na.rm' is FALSE") :
# the condition has length > 1 and only the first element will be used
要解决您的具体问题,请将 x1
到 x8
放在 c()
中以形成向量:
df %>% group_by(loc, year) %>%
dplyr::summarise(lower.x = quantile(c(x1, x2, x3, x4, x5, x6 , x7, x8), probs = 0.025),
mean.x = quantile(c(x1, x2, x3, x4, x5, x6 , x7, x8), probs = 0.5),
upper.x = quantile(c(x1, x2, x3, x4, x5, x6 , x7, x8), probs = 0.975))
产量:
# A tibble: 8 x 5
# Groups: loc [?]
loc year lower.x mean.x upper.x
<int> <int> <dbl> <dbl> <dbl>
1 1 1980 -1.12583212 0.1683845 1.1579655
2 1 1981 -1.20363611 -0.1399433 1.9308253
3 1 1982 -0.93238412 -0.3195850 0.3835611
4 1 1983 -2.08331501 -0.4235632 1.2267823
5 2 1980 -1.46528453 -0.3096375 0.9863813
6 2 1981 -1.51563211 0.1100798 0.8267675
7 2 1982 -1.16435350 0.1885864 0.8349510
8 2 1983 -0.01427533 0.4301591 1.9688637
顺便说一句,上限应该是 0.975,你打错了 0.0975