我如何创建一个变量来告诉我哪些其他变量是第一个没有一个观察值缺失的变量？

Question

如果我的数据框df中有如下数据结构：

a  b  c  d

1  2  3  4
NA NA 1  2
NA 1  2  NA
NA NA NA 1

我如何创建一个变量来告诉我哪个变量是第一个没有缺失值的变量，这样：

a  b  c  d  var

1  2  3  4  a
NA NA 1  2  c
NA 1  2  NA b
NA NA NA 1  d

我需要代码来处理变量名而不是列号，因为数据集的大小和更改变量的顺序。

我试过：

df <- df %>% mutate(var = coalesce(deparse(substitute(a)), deparse(substitute(b)), deparse(substitute(c)), deparse(substitute(d))))

和

df <- df %>% mutate(var = deparse(substitute(do.call(coalesce, across(c(a, b, c, d))))))

正在尝试实施这种方法。我得到了从中提取变量名字符串的代码： How to convert variable (object) name into String

Answer 1

你可以做到

df %>% mutate(var = apply(., 1, \(x) names(which(!is.na(x)))[1]))
#>    a  b  c  d var
#> 1  1  2  3  4   a
#> 2 NA NA  1  2   c
#> 3 NA  1  2 NA   b
#> 4 NA NA NA  1   d

Answer 2

我们可以使用max.col，即

names(df)[max.col(!is.na(df), ties.method = 'first')]
#[1] "a" "c" "b" "d"

通过dplyr、

library(dplyr)

df %>% 
 mutate(var = names(.)[max.col(!is.na(.), ties.method = 'first')])

   a  b  c  d var
1  1  2  3  4   a
2 NA NA  1  2   c
3 NA  1  2 NA   b
4 NA NA NA  1   d

您可以指定变量

df %>% 
 mutate(var = names(.[c('a', 'b')])[max.col(!is.na(.[c('a', 'b')]), ties.method = 'first')])
   a  b  c  d var
1  1  2  3  4   a
2 NA NA  1  2   a
3 NA  1  2 NA   b
4 NA NA NA  1   a

您还可以通过select()更改变量的顺序，即

df %>% 
 select(c, d, b, a) %>%
 mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')])

   c  d  b  a new
1  3  4  2  1   c
2  1  2 NA NA   c
3  2 NA  1 NA   c
4 NA  1 NA NA   d

您也可以再次调用 select() 以恢复列的原始顺序，但保留先前顺序的结果，即

df %>% 
 select(c, d, b, a) %>%
 mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')]) %>% 
 select(names(df), new)

   a  b  c  d new
1  1  2  3  4   c
2 NA NA  1  2   c
3 NA  1  2 NA   c
4 NA NA NA  1   d

要在最后维护所有变量，可以在原始数据框上加入，即

df %>% 
 select(c, d, b) %>%
 mutate(new = names(.)[max.col(!is.na(.), ties.method = 'first')]) %>% 
 left_join(df) %>% 
 select(names(df), new)

Joining, by = c("c", "d", "b")
   a  b  c  d new
1  1  2  3  4   c
2 NA NA  1  2   c
3 NA  1  2 NA   c
4 NA NA NA  1   d

我如何创建一个变量来告诉我哪些其他变量是第一个没有一个观察值缺失的变量？

How do I create a variable that tells me which of a number of other variables is the first one to not have a missing value for one observation?

variables

r

na