R:将数据分配给他们的百分位数

R: Assigning Data to their Percentiles

我正在使用 R 编程语言。假设,我有以下数据框:

var_1 = rnorm(100,10,10)
var_2 = rnorm(100,10,10)
var_3 = rnorm(100,10,10)

d = data.frame(var_1, var_2, var_3)

head(d)


      var_1     var_2      var_3
1 14.251923 14.877801  22.636207
2  7.325137  8.513718  21.021522
3  3.400001 -3.400397  11.274797
4 16.400597  8.623980   9.366115
5  7.065583 13.155570  17.891432
6 21.297912  4.341385 -11.337330

我的问题:对于每个变量中的每个元素,我想用它所属的百分位数(例如第 5、10、15 等)替换该元素。

例如:

a = quantile(d$var_1, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))
b = quantile(d$var_2, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))
c = quantile(d$var_3, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))

new = data.frame(a,b,c)

              a           b          c
5%   -0.8806901 -7.40560488 -4.7353920
10%   0.3595086 -3.77910527 -0.6874766
15%   1.1201300 -2.91946322  0.9584040
20%   3.0581928  0.05127097  2.1457693
25%   5.0901641  1.91719913  4.6997966
30%   7.0056228  2.56215345  6.2691894
35%   7.6089831  3.58688942  7.1900823
40%   8.9853805  5.00957881  7.8488446
45%   9.9264540  5.73653135  8.6135093
50%  10.2235212  7.43425669  9.6063344
55%  11.5707533  8.54160196 10.9239040
60%  13.2422940  9.65006232 11.7036647
65%  15.1076889 11.07081528 13.2440004
70%  16.5354881 12.38804922 15.2585324
75%  17.9336020 13.16121940 17.6656208
80%  19.5312682 15.31472178 18.4820207
85%  21.9264905 17.99689941 19.3347983
90%  24.4511364 20.47478783 22.0647173
95%  26.6820271 25.27082341 24.4473033
100% 41.4419744 39.75848302 34.5105183

现在,每次在每个百分位数范围之间有一个变量时,我想进行以下替换:

...

等等

等等

等等

有人可以告诉我怎么做吗?

这可能就是您想要的

apply(d, 2, function(x) paste0( ntile(x, n = 20L) / 20 * 100, "th percentile" ))

输出

       var_1              var_2              var_3             
  [1,] "60th percentile"  "100th percentile" "25th percentile" 
  [2,] "80th percentile"  "60th percentile"  "100th percentile"
  [3,] "45th percentile"  "90th percentile"  "75th percentile" 
  [4,] "70th percentile"  "85th percentile"  "35th percentile" 
  [5,] "30th percentile"  "5th percentile"   "55th percentile" 
  ...

补充

library(data.table)
cols = c("var_1", "var_3")
setDT(d)[, (cols) := lapply(.SD, function(x) paste0( ntile(x, n = 20L) / 20 * 100, "th percentile")), .SDcols = cols]

这里也可以使用purrr(感谢@PeaceWang提供的功能)

library(tidyverse)

output <- purrr::map(d, function(x)
  paste0(ntile(x, n = 20L) / 20 * 100, "th percentile")) %>%
  as.data.frame()

输出

head(output, 10)

               var_1            var_2            var_3
1    40th percentile  15th percentile  85th percentile
2     5th percentile  60th percentile  70th percentile
3    65th percentile  60th percentile  65th percentile
4    60th percentile  10th percentile  75th percentile
5    15th percentile  40th percentile   5th percentile
6    10th percentile  35th percentile  85th percentile
7    30th percentile  45th percentile  95th percentile
8    85th percentile  25th percentile  45th percentile
9    75th percentile  90th percentile  80th percentile
10   65th percentile 100th percentile  10th percentile

这是我的 santoku 包裹的几乎单行本:

library(santoku)
d[] <- apply(d, 2, chop_quantiles, probs = 0:100/100, 
               labels = lbl_endpoint(fmt = "%.2f"))
d[] <- apply(d, 2, as.numeric)

左侧的 d[] 是将 d 保留为数据框的技巧。