获取 head() 和 tail() ，其中 NA 存在于许多变量中

Question

我有几十个变量，其中很多都有缺失值，包括第一次和最后一次观察。我想要一个新的数据集，其中包含每个人对每个变量的第一次和最后一次观察，忽略缺失。

下面的代码做到了，但我希望 1) 有一些类似于 head() 的功能，但不必手动删除 NAs，2) 一种写法dplyr 的 summarize_each() 可以用来自动处理数据集中所有变量的函数（当然 id 除外）

set.seed(23331)
df <- data.frame(id=rep(c(1,2,3,4), each = 5),
                 a = c(NA, rnorm(4), rnorm(3), rep(NA, 2), rnorm(4), rep(NA, 5), rnorm(1)),
                 b = c(rep(NA, 2), rnorm(14), rep(NA, 3), rnorm(1)))
df %>% group_by(id) %>% summarise(a.head=head(a[!is.na(a)], n=1), 
                                  a.tail=tail(a[!is.na(a)], n=1),
                                  b.head=head(b[!is.na(b)], n=1), 
                                  b.tail=tail(b[!is.na(b)], n=1)) %>% 
  gather("type", "value", -id) %>% 
  separate(type, into = c("variable", "time"), sep = "\.") %>% 
  spread(variable, value)

我希望有一个 dplyr 解决方案，但如果其中之一是最佳解决方案，我会采用 base 或 data.table 解决方案。

期望的输出：

来源：本地数据框[8 x 4]

     id  time          a          b
  (dbl) (chr)      (dbl)      (dbl)
1     1  head -0.5877282  0.4975612
2     1  tail -0.7904277 -0.3860010
3     2  head  0.5872134 -0.3923887
4     2  tail -0.3222003  0.3114662
5     3  head -0.2553290  0.7521095
6     3  tail  0.3095699 -0.9113326
7     4  head -0.3809334  1.4752274
8     4  tail -0.3809334  3.2767918

Answer 1

我们将 'data.frame' 转换为 'data.table' (setDT(df))，按 'id' 分组，我们遍历 Data.table 的子集 (lapply(.SD,..) 并把每列的 head 和 tail 作为标题。

library(data.table)
f1 <- function(x, n) {x1 <- x[!is.na(x)]; c(head(x1,n), tail(x1,n))}
setDT(df)[,lapply(.SD, f1, n=1) ,id][, time:= c('head', 'tail')][]

或使用melt/dcast

 DT <- setDT(df)[,melt(lapply(.SD, function(x) list(head=head(x[!is.na(x)],1),
              tail=tail(x[!is.na(x)],1)))) ,id]
 dcast(DT, id+L2~L1, value.var='value')

Answer 2

dplyr 不适用于导致除 1 或 n() 以外的许多行的转换。

要留在那个世界，你可以使用（据我所知）低效的 do:

library(magrittr)
ht_nona = . %>% na.omit %>% { c(first(.), dplyr::last(.)) }

df %>% group_by(id) %>% do( as.data.frame(lapply(., ht_nona)) )

另一个（可以说更糟）选项是 summarise 两次并绑定行：

bind_rows(
  df %>% group_by(id) %>% summarise_each(funs(. %>% na.omit %>% first)),
  df %>% group_by(id) %>% summarise_each(funs(. %>% na.omit %>% (dplyr::last)))
)

Answer 3

@akrun 答案的变体，同样是 data.table:

library(data.table)

setDT(df)[, c(
  list(time=c("head","tail")), 
  lapply(.SD, function(v) setDT(list(v))[!is.na(V1)][c(1,.N), V1] )
), by=id]

   id time          a          b
1:  1 head -0.5877282  0.4975612
2:  1 tail -0.7904277 -0.3860010
3:  2 head  0.5872134 -0.3923887
4:  2 tail -0.3222003  0.3114662
5:  3 head -0.2553290  0.7521095
6:  3 tail  0.3095699 -0.9113326
7:  4 head -0.3809334  1.4752274
8:  4 tail -0.3809334  3.2767918

setDT(list(v)) .

Answer 4

缺点是这个需要三个包

set.seed(23331)
df <- data.frame(id=rep(c(1,2,3,4), each = 5),
                 a = c(NA, rnorm(4), rnorm(3), rep(NA, 2), rnorm(4), rep(NA, 5), rnorm(1)),
                 b = c(rep(NA, 2), rnorm(14), rep(NA, 3), rnorm(1)))

library('base')
library('utils')
library('stats')

data.frame(id = rep(1:4, each = 2), time = c('head', 'tail'), 
           sapply(df[, -1], function(x) unlist(tapply(x, df$id, FUN = function(y)
             c(head(na.omit(y), 1), tail(na.omit(y), 1))))))

#    id time          a          b
# 11  1 head -0.5877282  0.4975612
# 12  1 tail -0.7904277 -0.3860010
# 21  2 head  0.5872134 -0.3923887
# 22  2 tail -0.3222003  0.3114662
# 31  3 head -0.2553290  0.7521095
# 32  3 tail  0.3095699 -0.9113326
# 41  4 head -0.3809334  1.4752274
# 42  4 tail -0.3809334  3.2767918

获取 head() 和 tail() ，其中 NA 存在于许多变量中

Gettings head() and tail() where NA is present for many variables

r

dplyr

data.table

tidyr