R - cut2 与分位数函数

Question

谁能告诉我 R 中的分位数函数和 HMISC 包中的 cut2 函数之间的区别？

我知道分位数有 9 种不同的方法来指定四分位数。但是，当我使用函数 cut2(mydata, g = 4) 时，输出的四分位数不对应于任何分位数函数输出。

非常感谢任何帮助。

提前致谢。

Answer 1

来自 cut2 帮助文件：

Function like cut but left endpoints are inclusive and labels are of the form [lower, upper), except that last interval is [lower,upper]. If cuts are given, will by default make sure that cuts include entire range of x.

因此，cut2 基本上是 cut，具有一些不同的默认值。那我们再看看cut

来自 cut 帮助文件：

cut divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.

来自 quantile 帮助文件：

The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.

一个切x的范围，一个切x的"frequency"。

插图：

out <- 0:100
out2 <- c(seq(0, 50, 0.001), 51:100)

两者的范围相同。从 0 到 100。

levels(cut(out,4, include.lowest = T))
[1] "[-0.1,25]" "(25,50]"   "(50,75]"   "(75,100]" 
levels(cut(out2,4, include.lowest = T))
[1] "[-0.1,25]" "(25,50]"   "(50,75]"   "(75,100]"

但是 out2 中还有更多 "datapoints"，特别是 0 到 50 之间的值。因此，它们在以下范围内的频率不同：

quantile(out)
  0%  25%  50%  75% 100% 
   0   25   50   75  100 
quantile(out2)
      0%      25%      50%      75%     100% 
  0.0000  12.5125  25.0250  37.5375 100.0000

这是cut和quantile的区别。

上面的例子还显示了两者都同意的情况，即在均匀分布的情况下。例如0到100的序列，均匀分布在0到100的范围内。这里，两者基本相同。

为了进一步说明，请考虑：

outdf <- data.frame(out=out, cut=cut(out,4, include.lowest = T))
out2df <- data.frame(out=out2, cut=cut(out2,4, include.lowest = T))

table(outdf$cut)
[-0.1,25]   (25,50]   (50,75]  (75,100] 
       26        25        25        25 
table(out2df$cut)
[-0.1,25]   (25,50]   (50,75]  (75,100] 
    25001     25000        25        25

在这里，您可以清楚地看到每个 bin 中的不同频率。

R - cut2 与分位数函数

R - cut2 versus quantile function

r

quantile

hmisc

quartile