在R中将变量切割成碎片

Question

我正在尝试 cut() 我的数据 D 分为 3 部分：[0-4]、[5-12]、[13-40]（请参阅下图 )。但我想知道如何在 cut 中准确定义我的 breaks 来实现这一点？

这是我的数据和R代码：

D <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/t.csv", h = T)


 table(cut(D$time, breaks = c(0, 5, 9, 12))) ## what should breaks be?

 # (0,5]  (5,9] (9,12]  # cuts not how I want the 3 pieces .
 #  228     37     10

Answer 1

符号 (a,b] 表示“>a 和 <=b”。

因此，要获得您想要的结果，只需定义切割以获得您想要的分组，包括下限和上限：

table(cut(D$time, breaks=c(-1, 4, 12, 40)))

## (-1,4]  (4,12] (12,40] 
##   319      47      20

您可能还会发现查看两个参数 right=FALSE 很有帮助，它将区间的端点从 (a,b] 更改为 [a,b)，以及 include.lowest，其中包括最低的 breaks 值（在 OP 的示例中，这是 [0,5] 在下限上带有闭括号）。您也可以使用无穷大。这是一个使用了其中几个选项的示例：

table(cut(D$time, breaks = c(-Inf, 4, 12, Inf), include.lowest=TRUE))

## [-Inf,4]    (4,12] (12, Inf] 
##     319        47        20

Answer 2

这会产生正确的桶，但间隔符号需要调整。假设所有时间都是整数。可能需要手动调整标签 - 每次您使用右开区间符号时，将因子标签替换为闭区间符号。使用你最好的字符串 'magic'

就个人而言，我喜欢确保涵盖所有可能性。也许来自这个过程的未来数据可能会超过 40？我喜欢在所有剪辑中设置 +Inf 的上限。这可以防止 NA 潜入数据。

cut 需要的是“仅限整数”选项。

F=cut(D$time,c(0,5,13,40),include.lowest = TRUE,right=FALSE)
# the below levels hard coded but you could write a loop to turn all labels
# of the form [m,n) into [m,n-1]
levels(F)[1:2]=c('[0,4]','[5,12]')

通常在获得最终结果之前会有更多的分析，所以在工作接近完成之前我不会过多地处理标签。

这是我的结果

 > table(F) 
 F
 [0,4]  [5,12]  [13,40] 
 319      47      20

Answer 3

R 可以将整数与浮点数进行比较，如

> 6L >= 8.5
[1] FALSE

因此你可以在 breaks 中使用浮点数，例如

table(cut(D$time, breaks = c(-.5, 4.5, 12.5, 40.5)))

对于整数，这满足了您对 [0-4], [5-12], [13-40] 的存储桶定义，而您不必过多考虑方括号和圆括号。

一个奇特的替代方案是像

中那样围绕你的桶的平均值聚集

D <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/t.csv", h = T)
D$cluster <- kmeans(D$time, center = c(4/2, (5+12)/2, (13+40)/2))$cluster
plot(D$time, rnorm(nrow(D)), col=D$cluster)

Answer 4

您应该在代码中添加两个附加参数 right 和 include.lowest！

table(cut(D$time, breaks = c(0, 5, 13, 40), right=FALSE, include.lowest = TRUE))

在 right=FALSE 的情况下，间隔应该在左侧关闭并在右侧打开，这样您就会得到想要的结果。 include.lowest=TRUE 导致您的最高中断值（此处为 40）包含在最后一个间隔中。结果：

[0,5)  [5,13) [13,40] 
 319      47      20

反过来你可以这样写：

table(cut(D$time, breaks = c(0, 4, 12, 40), right=TRUE, include.lowest = TRUE))

结果：

 [0,4]  (4,12] (12,40] 
  319      47      20

两者都表示您要查找的内容：

[0,4]  [5,12] [13,40] 
 319      47      20

在R中将变量切割成碎片

cutting a variable into pieces in R

statistics

r

function

dataframe

categorical-data