R - cut2 与分位数函数

R - cut2 versus quantile function

谁能告诉我 R 中的分位数函数和 HMISC 包中的 cut2 函数之间的区别?

我知道分位数有 9 种不同的方法来指定四分位数。但是,当我使用函数 cut2(mydata, g = 4) 时,输出的四分位数不对应于任何分位数函数输出。

非常感谢任何帮助。

提前致谢。

来自 cut2 帮助文件:

Function like cut but left endpoints are inclusive and labels are of the form [lower, upper), except that last interval is [lower,upper]. If cuts are given, will by default make sure that cuts include entire range of x.

因此,cut2 基本上是 cut,具有一些不同的默认值。那我们再看看cut

来自 cut 帮助文件:

cut divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.

来自 quantile 帮助文件:

The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.

一个切x的范围,一个切x的"frequency"。

插图:

out <- 0:100
out2 <- c(seq(0, 50, 0.001), 51:100)

两者的范围相同。从 0 到 100。

levels(cut(out,4, include.lowest = T))
[1] "[-0.1,25]" "(25,50]"   "(50,75]"   "(75,100]" 
levels(cut(out2,4, include.lowest = T))
[1] "[-0.1,25]" "(25,50]"   "(50,75]"   "(75,100]" 

但是 out2 中还有更多 "datapoints",特别是 0 到 50 之间的值。因此,它们在以下范围内的频率不同:

quantile(out)
  0%  25%  50%  75% 100% 
   0   25   50   75  100 
quantile(out2)
      0%      25%      50%      75%     100% 
  0.0000  12.5125  25.0250  37.5375 100.0000 

这是cutquantile的区别。

上面的例子还显示了两者都同意的情况,即在均匀分布的情况下。例如0到100的序列,均匀分布在0到100的范围内。这里,两者基本相同。

为了进一步说明,请考虑:

outdf <- data.frame(out=out, cut=cut(out,4, include.lowest = T))
out2df <- data.frame(out=out2, cut=cut(out2,4, include.lowest = T))

table(outdf$cut)
[-0.1,25]   (25,50]   (50,75]  (75,100] 
       26        25        25        25 
table(out2df$cut)
[-0.1,25]   (25,50]   (50,75]  (75,100] 
    25001     25000        25        25 

在这里,您可以清楚地看到每个 bin 中的不同频率。