R - cut2 与分位数函数
R - cut2 versus quantile function
谁能告诉我 R 中的分位数函数和 HMISC 包中的 cut2 函数之间的区别?
我知道分位数有 9 种不同的方法来指定四分位数。但是,当我使用函数 cut2(mydata, g = 4) 时,输出的四分位数不对应于任何分位数函数输出。
非常感谢任何帮助。
提前致谢。
来自 cut2
帮助文件:
Function like cut but left endpoints are inclusive and labels are of
the form [lower, upper), except that last interval is [lower,upper].
If cuts are given, will by default make sure that cuts include entire
range of x.
因此,cut2
基本上是 cut
,具有一些不同的默认值。那我们再看看cut
来自 cut
帮助文件:
cut divides the range of x into intervals and codes the values in x
according to which interval they fall. The leftmost interval
corresponds to level one, the next leftmost to level two and so on.
来自 quantile
帮助文件:
The generic function quantile produces sample quantiles corresponding
to the given probabilities. The smallest observation corresponds to a
probability of 0 and the largest to a probability of 1.
一个切x
的范围,一个切x
的"frequency"。
插图:
out <- 0:100
out2 <- c(seq(0, 50, 0.001), 51:100)
两者的范围相同。从 0 到 100。
levels(cut(out,4, include.lowest = T))
[1] "[-0.1,25]" "(25,50]" "(50,75]" "(75,100]"
levels(cut(out2,4, include.lowest = T))
[1] "[-0.1,25]" "(25,50]" "(50,75]" "(75,100]"
但是 out2
中还有更多 "datapoints",特别是 0 到 50 之间的值。因此,它们在以下范围内的频率不同:
quantile(out)
0% 25% 50% 75% 100%
0 25 50 75 100
quantile(out2)
0% 25% 50% 75% 100%
0.0000 12.5125 25.0250 37.5375 100.0000
这是cut
和quantile
的区别。
上面的例子还显示了两者都同意的情况,即在均匀分布的情况下。例如0到100的序列,均匀分布在0到100的范围内。这里,两者基本相同。
为了进一步说明,请考虑:
outdf <- data.frame(out=out, cut=cut(out,4, include.lowest = T))
out2df <- data.frame(out=out2, cut=cut(out2,4, include.lowest = T))
table(outdf$cut)
[-0.1,25] (25,50] (50,75] (75,100]
26 25 25 25
table(out2df$cut)
[-0.1,25] (25,50] (50,75] (75,100]
25001 25000 25 25
在这里,您可以清楚地看到每个 bin 中的不同频率。
谁能告诉我 R 中的分位数函数和 HMISC 包中的 cut2 函数之间的区别?
我知道分位数有 9 种不同的方法来指定四分位数。但是,当我使用函数 cut2(mydata, g = 4) 时,输出的四分位数不对应于任何分位数函数输出。
非常感谢任何帮助。
提前致谢。
来自 cut2
帮助文件:
Function like cut but left endpoints are inclusive and labels are of the form [lower, upper), except that last interval is [lower,upper]. If cuts are given, will by default make sure that cuts include entire range of x.
因此,cut2
基本上是 cut
,具有一些不同的默认值。那我们再看看cut
来自 cut
帮助文件:
cut divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.
来自 quantile
帮助文件:
The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.
一个切x
的范围,一个切x
的"frequency"。
插图:
out <- 0:100
out2 <- c(seq(0, 50, 0.001), 51:100)
两者的范围相同。从 0 到 100。
levels(cut(out,4, include.lowest = T))
[1] "[-0.1,25]" "(25,50]" "(50,75]" "(75,100]"
levels(cut(out2,4, include.lowest = T))
[1] "[-0.1,25]" "(25,50]" "(50,75]" "(75,100]"
但是 out2
中还有更多 "datapoints",特别是 0 到 50 之间的值。因此,它们在以下范围内的频率不同:
quantile(out)
0% 25% 50% 75% 100%
0 25 50 75 100
quantile(out2)
0% 25% 50% 75% 100%
0.0000 12.5125 25.0250 37.5375 100.0000
这是cut
和quantile
的区别。
上面的例子还显示了两者都同意的情况,即在均匀分布的情况下。例如0到100的序列,均匀分布在0到100的范围内。这里,两者基本相同。
为了进一步说明,请考虑:
outdf <- data.frame(out=out, cut=cut(out,4, include.lowest = T))
out2df <- data.frame(out=out2, cut=cut(out2,4, include.lowest = T))
table(outdf$cut)
[-0.1,25] (25,50] (50,75] (75,100]
26 25 25 25
table(out2df$cut)
[-0.1,25] (25,50] (50,75] (75,100]
25001 25000 25 25
在这里,您可以清楚地看到每个 bin 中的不同频率。