使用 apply() 对矩阵进行子集化时返回值的奇怪行为

Question

我觉得这一定是显而易见的事情，但我花了一整天的时间试图弄清楚并寻找答案，所以我把它贴在这里希望有人能提供一些意见。

简短版本是使用 apply() 通过另一个数据帧中的值和 return 平均值对数据帧进行子集化，函数 returns NA 对于前 9 次迭代如果输入行数大于 9。如果输入行数小于 9，则 return 数据正常。对我来说，我所做的导致这种情况并不明显。

我有两个数据框。第一个是从大量样本中按顺序获取的数据。 "ID" 列的因子水平代表样本。在每个因素中，"Length" 列对应于在样本的一条线上进行测量的位置。有两个数据列用于两个不同的测量。下面是一个简化的、可重复的表示：

set.seed(10)
ID=factor(sort(rep(paste(letters[1:10]), times=10)))
Length=seq(1:10) + runif(10, 0, 0.9)
Values_1=c(1:20)
Values_2=c(21:40)
test_data=cbind.data.frame(ID, Length, Values_1, Values_2)

接下来我有一个截止值矩阵，我想用它来对 "test_data" 中的 "Length" 列进行子集化。每行显示我想要子集化的样本，以及子集的起点和终点。

ID2=sort(rep(paste(letters[1:10]), times=2)) 
Start=c(1, 5, 1, 5)
Stop=c(5, 10, 7, 10)
Row=c(1:20)
cutoffs=cbind.data.frame(ID2, Start, Stop, Row)
colnames(cutoffs)=c("ID", "Start", "Stop", "Row")
#I'm recycling the cutoffs here. In reality the cutoffs are all pretty different

如果我手动对数据进行子集化，它适用于我选择的任何行，

r=9
subset1=test_data$Values_1[test_data[,1] == cutoffs[r,1] &
                           test_data[,2] >= cutoffs[r,2] &
                           test_data[,2] <= cutoffs[r,3] &
                           !is.na(test_data[,3])]
#[1] 1 2 3 4
mean(subset1)
#There are no NA's in this test data, but the !is.na is there to catch NA's that exist in the real data

但是当我构建一个应用函数来对所有数据进行子集化时，事情变得很奇怪，我不知道为什么。如果我运行函数它只 returns 截止值 [10:20,] 并且前 9 个样本被赋予 NA。但是运行第 1 行和第 9 行之间的任何截止子集 return 都是正确的值。

apply(cutoffs, 1, function(x){
  subset_1=test_data$Values_1[test_data[,1] == cutoffs[x[4],1] &
                              test_data[,2] >= cutoffs[x[4],2] &
                              test_data[,2] <= cutoffs[x[4],3] &
                              !is.na(test_data[,3])]
  subset_2=test_data$Values_2[test_data[,1] == cutoffs[x[4],1] &
                              test_data[,2] >= cutoffs[x[4],2] &
                              test_data[,2] <= cutoffs[x[4],3] &
                              !is.na(test_data[,4])]
  Mean_1=mean(subset_1)
  Mean_2=mean(subset_2)
  c(Mean_1, Mean_2)
})

    #     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
    #[1,]   NA   NA   NA   NA   NA   NA   NA   NA   NA     7  13.5    17   2.5     7  13.5    17   2.5     7  13.5    17
    #[2,]   NA   NA   NA   NA   NA   NA   NA   NA   NA    27  33.5    37  22.5    27  33.5    37  22.5    27  33.5    37

#Running the same function, but subsetting below 9 rows it returns the correct values
#apply(cutoffs[1:9,], 1, function(x){...
#        1  2    3  4    5  6    7  8    9
#[1,]  2.5  7 13.5 17  2.5  7 13.5 17  2.5
#[2,] 22.5 27 33.5 37 22.5 27 33.5 37 22.5

我知道这一定有一些很好的理由，但我不知道是什么。任何帮助将不胜感激。

如果有更优雅的方法，请告诉我。实际数据集要大得多，"cutoffs" 相当于大约 3K 行，"test_data" 相当于 250K 行。这个函数需要很长时间才能运行所以我假设有更好的方法来做到这一点。

Answer 1

首先，不要在数据框上使用apply。它会将 df 转换为矩阵，这意味着所有列都将被强制转换为单一类型。特别是，如果任何列是字符或因子，则生成的矩阵也将是字符。

但这不是问题所在。让我们看看您提供的第一个代码块：

subset1 <- test_data$Values_1[test_data[,1] == cutoffs[r,1] &
                              test_data[,2] >= cutoffs[r,2] &
                              test_data[,2] <= cutoffs[r,3] &
                              !is.na(test_data[,3])]

第二个代码块：

subset_1 <- test_data$Values_1[test_data[,1] == cutoffs[x[4],1] &
                               test_data[,2] >= cutoffs[x[4],2] &
                               test_data[,2] <= cutoffs[x[4],3] &
                               !is.na(test_data[,3])]
subset_2 <- test_data$Values_2[test_data[,1] == cutoffs[x[4],1] &
                               test_data[,2] >= cutoffs[x[4],2] &
                               test_data[,2] <= cutoffs[x[4],3] &
                               !is.na(test_data[,4])]

这些不一样（x的第4个元素有什么意义？）。假设第一个代码块是您想要的，那么将它应用于所有行将如下所示。

sapply(seq_len(nrow(cutoffs)), function(r) {
    vals1 <- test_data$Values_1[test_data[,1] == cutoffs[r,1] &
                                test_data[,2] >= cutoffs[r,2] &
                                test_data[,2] <= cutoffs[r,3] &
                                !is.na(test_data[,3])]
    vals2 <- test_data$Values_2[test_data[,1] == cutoffs[r,1] &
                                test_data[,2] >= cutoffs[r,2] &
                                test_data[,2] <= cutoffs[r,3] &
                                !is.na(test_data[,3])]
    c(mean(vals1), mean(vals2))
})

#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
#[1,]  2.5    7 13.5   17  2.5    7 13.5   17  2.5     7  13.5    17   2.5     7  13.5    17   2.5     7  13.5    17
#[2,] 22.5   27 33.5   37 22.5   27 33.5   37 22.5    27  33.5    37  22.5    27  33.5    37  22.5    27  33.5    37

使用 apply() 对矩阵进行子集化时返回值的奇怪行为

Strange behavior of returned value when using apply() to subset a matrix

r

subset

apply