R:将数据分配给他们的百分位数
R: Assigning Data to their Percentiles
我正在使用 R 编程语言。假设,我有以下数据框:
var_1 = rnorm(100,10,10)
var_2 = rnorm(100,10,10)
var_3 = rnorm(100,10,10)
d = data.frame(var_1, var_2, var_3)
head(d)
var_1 var_2 var_3
1 14.251923 14.877801 22.636207
2 7.325137 8.513718 21.021522
3 3.400001 -3.400397 11.274797
4 16.400597 8.623980 9.366115
5 7.065583 13.155570 17.891432
6 21.297912 4.341385 -11.337330
我的问题:对于每个变量中的每个元素,我想用它所属的百分位数(例如第 5、10、15 等)替换该元素。
例如:
a = quantile(d$var_1, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))
b = quantile(d$var_2, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))
c = quantile(d$var_3, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))
new = data.frame(a,b,c)
a b c
5% -0.8806901 -7.40560488 -4.7353920
10% 0.3595086 -3.77910527 -0.6874766
15% 1.1201300 -2.91946322 0.9584040
20% 3.0581928 0.05127097 2.1457693
25% 5.0901641 1.91719913 4.6997966
30% 7.0056228 2.56215345 6.2691894
35% 7.6089831 3.58688942 7.1900823
40% 8.9853805 5.00957881 7.8488446
45% 9.9264540 5.73653135 8.6135093
50% 10.2235212 7.43425669 9.6063344
55% 11.5707533 8.54160196 10.9239040
60% 13.2422940 9.65006232 11.7036647
65% 15.1076889 11.07081528 13.2440004
70% 16.5354881 12.38804922 15.2585324
75% 17.9336020 13.16121940 17.6656208
80% 19.5312682 15.31472178 18.4820207
85% 21.9264905 17.99689941 19.3347983
90% 24.4511364 20.47478783 22.0647173
95% 26.6820271 25.27082341 24.4473033
100% 41.4419744 39.75848302 34.5105183
现在,每次在每个百分位数范围之间有一个变量时,我想进行以下替换:
- 如果
d$var_1 < -0.8806901
,则d$var_1 == as.factor("5th percentile")
- 如果
d$var_1 > -0.8806901 d$var_1 < 0.3595086
,则d$var_1 == as.factor("10th percentile")
...
- 如果
d$var_1 > 15.1076889 d$var_1 < 16.5354881
,则d$var_1 == as.factor("65th percentile")
等等
- 如果
d$var_2 < -7.40560488
,则d$var_2 == as.factor("5th percentile")
等等
- 如果
d$var_3 < -4.7353920
,则d$var_3 == as.factor("5th percentile")
等等
有人可以告诉我怎么做吗?
这可能就是您想要的
apply(d, 2, function(x) paste0( ntile(x, n = 20L) / 20 * 100, "th percentile" ))
输出
var_1 var_2 var_3
[1,] "60th percentile" "100th percentile" "25th percentile"
[2,] "80th percentile" "60th percentile" "100th percentile"
[3,] "45th percentile" "90th percentile" "75th percentile"
[4,] "70th percentile" "85th percentile" "35th percentile"
[5,] "30th percentile" "5th percentile" "55th percentile"
...
补充
library(data.table)
cols = c("var_1", "var_3")
setDT(d)[, (cols) := lapply(.SD, function(x) paste0( ntile(x, n = 20L) / 20 * 100, "th percentile")), .SDcols = cols]
这里也可以使用purrr
(感谢@PeaceWang提供的功能)
library(tidyverse)
output <- purrr::map(d, function(x)
paste0(ntile(x, n = 20L) / 20 * 100, "th percentile")) %>%
as.data.frame()
输出
head(output, 10)
var_1 var_2 var_3
1 40th percentile 15th percentile 85th percentile
2 5th percentile 60th percentile 70th percentile
3 65th percentile 60th percentile 65th percentile
4 60th percentile 10th percentile 75th percentile
5 15th percentile 40th percentile 5th percentile
6 10th percentile 35th percentile 85th percentile
7 30th percentile 45th percentile 95th percentile
8 85th percentile 25th percentile 45th percentile
9 75th percentile 90th percentile 80th percentile
10 65th percentile 100th percentile 10th percentile
这是我的 santoku
包裹的几乎单行本:
library(santoku)
d[] <- apply(d, 2, chop_quantiles, probs = 0:100/100,
labels = lbl_endpoint(fmt = "%.2f"))
d[] <- apply(d, 2, as.numeric)
左侧的 d[]
是将 d
保留为数据框的技巧。
我正在使用 R 编程语言。假设,我有以下数据框:
var_1 = rnorm(100,10,10)
var_2 = rnorm(100,10,10)
var_3 = rnorm(100,10,10)
d = data.frame(var_1, var_2, var_3)
head(d)
var_1 var_2 var_3
1 14.251923 14.877801 22.636207
2 7.325137 8.513718 21.021522
3 3.400001 -3.400397 11.274797
4 16.400597 8.623980 9.366115
5 7.065583 13.155570 17.891432
6 21.297912 4.341385 -11.337330
我的问题:对于每个变量中的每个元素,我想用它所属的百分位数(例如第 5、10、15 等)替换该元素。
例如:
a = quantile(d$var_1, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))
b = quantile(d$var_2, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))
c = quantile(d$var_3, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))
new = data.frame(a,b,c)
a b c
5% -0.8806901 -7.40560488 -4.7353920
10% 0.3595086 -3.77910527 -0.6874766
15% 1.1201300 -2.91946322 0.9584040
20% 3.0581928 0.05127097 2.1457693
25% 5.0901641 1.91719913 4.6997966
30% 7.0056228 2.56215345 6.2691894
35% 7.6089831 3.58688942 7.1900823
40% 8.9853805 5.00957881 7.8488446
45% 9.9264540 5.73653135 8.6135093
50% 10.2235212 7.43425669 9.6063344
55% 11.5707533 8.54160196 10.9239040
60% 13.2422940 9.65006232 11.7036647
65% 15.1076889 11.07081528 13.2440004
70% 16.5354881 12.38804922 15.2585324
75% 17.9336020 13.16121940 17.6656208
80% 19.5312682 15.31472178 18.4820207
85% 21.9264905 17.99689941 19.3347983
90% 24.4511364 20.47478783 22.0647173
95% 26.6820271 25.27082341 24.4473033
100% 41.4419744 39.75848302 34.5105183
现在,每次在每个百分位数范围之间有一个变量时,我想进行以下替换:
- 如果
d$var_1 < -0.8806901
,则d$var_1 == as.factor("5th percentile")
- 如果
d$var_1 > -0.8806901 d$var_1 < 0.3595086
,则d$var_1 == as.factor("10th percentile")
...
- 如果
d$var_1 > 15.1076889 d$var_1 < 16.5354881
,则d$var_1 == as.factor("65th percentile")
等等
- 如果
d$var_2 < -7.40560488
,则d$var_2 == as.factor("5th percentile")
等等
- 如果
d$var_3 < -4.7353920
,则d$var_3 == as.factor("5th percentile")
等等
有人可以告诉我怎么做吗?
这可能就是您想要的
apply(d, 2, function(x) paste0( ntile(x, n = 20L) / 20 * 100, "th percentile" ))
输出
var_1 var_2 var_3
[1,] "60th percentile" "100th percentile" "25th percentile"
[2,] "80th percentile" "60th percentile" "100th percentile"
[3,] "45th percentile" "90th percentile" "75th percentile"
[4,] "70th percentile" "85th percentile" "35th percentile"
[5,] "30th percentile" "5th percentile" "55th percentile"
...
补充
library(data.table)
cols = c("var_1", "var_3")
setDT(d)[, (cols) := lapply(.SD, function(x) paste0( ntile(x, n = 20L) / 20 * 100, "th percentile")), .SDcols = cols]
这里也可以使用purrr
(感谢@PeaceWang提供的功能)
library(tidyverse)
output <- purrr::map(d, function(x)
paste0(ntile(x, n = 20L) / 20 * 100, "th percentile")) %>%
as.data.frame()
输出
head(output, 10)
var_1 var_2 var_3
1 40th percentile 15th percentile 85th percentile
2 5th percentile 60th percentile 70th percentile
3 65th percentile 60th percentile 65th percentile
4 60th percentile 10th percentile 75th percentile
5 15th percentile 40th percentile 5th percentile
6 10th percentile 35th percentile 85th percentile
7 30th percentile 45th percentile 95th percentile
8 85th percentile 25th percentile 45th percentile
9 75th percentile 90th percentile 80th percentile
10 65th percentile 100th percentile 10th percentile
这是我的 santoku
包裹的几乎单行本:
library(santoku)
d[] <- apply(d, 2, chop_quantiles, probs = 0:100/100,
labels = lbl_endpoint(fmt = "%.2f"))
d[] <- apply(d, 2, as.numeric)
左侧的 d[]
是将 d
保留为数据框的技巧。