通过分箱获取包含两个连续变量的数据帧的 2D table (6x6)

Question

我正在尝试根据两个连续变量将数据框中的观察结果分成 36 组。更具体地说，我试图将两个变量中的每一个分成六组，然后将观察结果分组到 36 个不同的可能组之一。

我的尝试如下，有效。但是有没有更快的方法来避免双重 for 循环？

此外，这不是必需的，但我如何在 6 x 6 的网格中可视化每组中的观察总数？我知道 table() 会生成 36 个可能的组及其总数的列表，但不是网格格式。

set.seed(123)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
data <- data.frame(x1,x2)

labs1 <- levels(cut(x1, 6))
ints1 <- cbind(lower = as.numeric(sub("\((.+),.*", "\1", labs1)),
               upper = as.numeric(sub("[^,]*,([^]]*)\]", "\1", labs1)))
labs2 <- levels(cut(x2, 6))
ints2 <- cbind(lower = as.numeric(sub("\((.+),.*", "\1", labs2)),
               upper = as.numeric(sub("[^,]*,([^]]*)\]", "\1", labs2)))

tmp <- expand.grid(labs1, labs2)
groups <- cbind(lower1 =  as.numeric(sub("\((.+),.*", "\1", tmp[,1])), 
                upper1 = as.numeric(sub("[^,]*,([^]]*)\]", "\1", tmp[,1])), 
                lower2 = as.numeric(sub("\((.+),.*", "\1", tmp[,2])),
                upper2 = as.numeric(sub("[^,]*,([^]]*)\]", "\1", tmp[,2])))

for (i in 1:1000){
  for (j in 1:36){
    if (x1[i] >= groups[j,1] & x1[i] <= groups[j,2] &
        x2[i] >= groups[j,3] & x2[i] <= groups[j,4]){
      data$group[i] <- j
    }
  }
}

Answer 1

您可以混合使用 apply() 来遍历您的 data.frame 和 which() 来遍历您的组 array:

data$group <- apply(data, 1, FUN=function(dataRow) 
  which(
    dataRow[1] >= groups[,1] & 
    dataRow[1] <= groups[,2] & 
    dataRow[2] >= groups[,3] & 
    dataRow[2] <= groups[,4]))

Answer 2

你想多了。获取 6x6 表格是 table() 的单行。（直接使用 cut(..., 6) 创建的有用因子变量，不要只是丢弃因子然后手动重新应用其水平并将变量装箱）：

with(data, table(cut(x1, 6), cut(x2, 6)))

                 (-3.05,-1.97] (-1.97,-0.902] (-0.902,0.171] (0.171,1.24] (1.24,2.32] (2.32,3.4]
  (-2.82,-1.8]               2             10             11            7           3          0
  (-1.8,-0.793]              1             26             67           49          19          3
  (-0.793,0.216]            12             57            140          146          31          3
  (0.216,1.22]              11             49            109           95          36          6
  (1.22,2.23]                0             10             31           34          15          0
  (2.23,3.25]                0              3              5            6           2          1

# and to get the wide lines, you may need...
options('width'=199)

# or if you want more compact labels to keep it all narrow, use `cut(..., dig.lab)`
with(data, table(cut(x1, 6, dig.lab=2), cut(x2, 6, dig.lab=2)))

               (-3.1,-2] (-2,-0.9] (-0.9,0.17] (0.17,1.2] (1.2,2.3] (2.3,3.4]
  (-2.8,-1.8]          2        10          11          7         3         0
  (-1.8,-0.79]         1        26          67         49        19         3
  (-0.79,0.22]        12        57         140        146        31         3
  (0.22,1.2]          11        49         109         95        36         6
  (1.2,2.2]            0        10          31         34        15         0
  (2.2,3.2]            0         3           5          6         2         1

诚然，table() 和 cut() 的文档都没有直接说明，可以使用这样的二维示例。 => Doc/Enhance-bug

通过分箱获取包含两个连续变量的数据帧的 2D table (6x6)

Get 2D table (6x6) for dataframe containing two continuous variables, by binning

statistics

cut

r

binning