R：列不同时对数据框行应用类似函数

Question

我有一个稀疏数据框example。它有五个数据列，但每行只有两个条目，随机分布在列中：

id  a   b   c   d   e
1   NA  10  NA  NA  1
2   6   NA  10  NA  NA
3   3   NA  NA  2   NA
4   NA  NA  9   4   NA
5   NA  NA  1   NA  5

我想要 return 一个只有两个数据列的数据框，每行中的值：

id  val1    val2
1   10      1
2   6       10
3   3       2
4   9       4
5   1       5

这可以通过 for 循环实现。但是我的真实数据很大，所以我想做一个类似apply的函数。我所看到的一切都假设您知道您将使用哪些列。我尝试制作自己的单行函数，然后使用 apply，但我一直收到错误 "incorrect number of dimensions".

Answer 1

尝试

d1 <- setNames(data.frame(example$id,t(apply(example[-1], 1,
                        function(x) x[!is.na(x)]))),
                                 c('id', 'val1', 'val2'))
d1
#  id val1 val2
#1  1   10    1
#2  2    6   10
#3  3    3    2
#4  4    9    4
#5  5    1    5

或者您可以转换为 'long' 格式，然后再转换回 'wide'

library(data.table)
dcast(melt(setDT(example), id.var='id', na.rm=TRUE)[,
           ind:=paste0('val', 1:.N) , id], id~ind, value.var='value')
#    id val1 val2
#1:  1   10    1
#2:  2    6   10
#3:  3    3    2
#4:  4    9    4
#5:  5    1    5

数据

example <- structure(list(id = 1:5, a = c(NA, 6L, 3L, NA, NA),
b = c(10L, 
NA, NA, NA, NA), c = c(NA, 10L, NA, 9L, 1L), d = c(NA, NA, 2L, 
4L, NA), e = c(1L, NA, NA, NA, 5L)), .Names = c("id", "a", "b", 
"c", "d", "e"), class = "data.frame", row.names = c(NA, -5L))

Answer 2

这应该是一个非常快速的方法：

temp <- t(example[-1])  # Matrix of all columns other than the first, transposed
cbind(example[1],       # Bind the first column with a two-column matrix
                        # created by using is.na and which
      matrix(temp[which(!is.na(temp), arr.ind = TRUE)], 
             ncol = 2, byrow = TRUE))
#   id  1  2
# 1  1 10  1
# 2  2  6 10
# 3  3  3  2
# 4  4  9  4
# 5  5  1  5

在一个包含 500 万行数据集的快速测试中，它的执行速度比 "data.table" 和 apply 方法都要快。

R：列不同时对数据框行应用类似函数

R: Apply-like function for dataframe rows when columns vary

r

apply

dataframe

数据