按组删除一个特定字符串前面的 NA,但保留在另一个特定字符串前面

Remove NA in front of one specific string but leave in front of another specific string, by group

我有这个数据框:

df <- data.frame(
  id = rep(1:4, each = 4), 
  status = c(
    NA, "a", "c", "a", 
    NA, "b", "c", "c",
    NA, NA, "a", "c",
    NA, NA, "b", "b"), 
  stringsAsFactors = FALSE)

对于每个组 (id),我的目标是删除在 "a" 前面(在列 "status" 中)但不在 [= 前面的一个或多个前导 NA 的行=25=].

最终数据框应如下所示:

structure(list(
  id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L), 
  status = c("a", "c", "a", NA, "b", "c", "c", "a", "c", NA, NA, "b", "b")), 
  .Names = c("id", "status"), row.names = c(NA, -13L), class = "data.frame")

我该怎么做?

编辑:或者,我将如何保留数据框中的其他变量,例如以下示例中的变量 otherVar:

df2 <- data.frame(
   id = rep(1:4, each = 4), 
   status = c(
    NA, "a", "c", "a", 
    NA, "b", "c", "c",
    NA, NA, "a", "c",
    NA, NA, "b", "b"),
  otherVar = letters[1:16],
  stringsAsFactors = FALSE)

我们可以通过 paste 将元素组合在一起 'id'、summarise 和 'status' 进行分组,然后使用 gsub 删除 NA 在 'a' 之前并使用 separate_rows

将其转换为 'long' 格式
library(dplyr)
library(tidyr)
df %>% 
 group_by(id) %>%
 summarise(status = gsub("(NA, ){1,}(?=a)", "", toString(status), 
       perl = TRUE)) %>% 
 separate_rows(status, convert = TRUE) 
# A tibble: 13 x 2
#      id status
#   <int> <chr> 
# 1     1 a     
# 2     1 c     
# 3     1 a     
# 4     2 NA    
# 5     2 b     
# 6     2 c     
# 7     2 c     
# 8     3 a     
# 9     3 c     
#10     4 NA    
#11     4 NA    
#12     4 b     
#13     4 b     

或使用 data.table 和相同的方法

library(data.table)
out1 <- setDT(df)[, strsplit(gsub("(NA, ){1,}(?=a)", "", 
            toString(status), perl = TRUE), ", "), id]
setnames(out1, 'V1', "status")[]
#    id status
# 1:  1      a
# 2:  1      c
# 3:  1      a
# 4:  2     NA
# 5:  2      b
# 6:  2      c
# 7:  2      c
# 8:  3      a
# 9:  3      c
#10:  4     NA
#11:  4     NA
#12:  4      b
#13:  4      b

更新

对于更新后的数据集'df2'

i1 <- setDT(df2)[, .I[seq(which(c(diff((status %in% "a") + 
              rleid(is.na(status))) > 1), FALSE))]  , id]$V1
df2[-i1]
#     id status otherVar
# 1:  1      a        b
# 2:  1      c        c
# 3:  1      a        d
# 4:  2     NA        e
# 5:  2      b        f
# 6:  2      c        g
# 7:  2      c        h
# 8:  3      a        k
# 9:  3      c        l
#10:  4     NA        m
#11:  4     NA        n
#12:  4      b        o
#13:  4      b        p

zoona.locfis.na,请注意它假设您的数据是有序的。

df[!(na.locf(df$status,fromLast = T)=='a'&is.na(df$status)),]
   id status
2   1      a
3   1      c
4   1      a
5   2   <NA>
6   2      b
7   2      c
8   2      c
11  3      a
12  3      c
13  4   <NA>
14  4   <NA>
15  4      b
16  4      b

这是一个 dplyr 解决方案和一个不太漂亮的 base 翻译:

dplyr

library(dplyr)
df %>% group_by(id) %>%
  filter(status[!is.na(status)][1]!="a" | !is.na(status))

# # A tibble: 13 x 2
# # Groups:   id [4]
#       id status
#    <int>  <chr>
#  1     1      a
#  2     1      c
#  3     1      a
#  4     2   <NA>
#  5     2      b
#  6     2      c
#  7     2      c
#  8     3      a
#  9     3      c
# 10     4   <NA>
# 11     4   <NA>
# 12     4      b
# 13     4      b

基础

do.call(rbind,
        lapply(split(df,df$id),
               function(x) x[x$status[!is.na(x$status)][1]!="a" | !is.na(x$status),]))

#      id status
# 1.2   1      a
# 1.3   1      c
# 1.4   1      a
# 2.5   2   <NA>
# 2.6   2      b
# 2.7   2      c
# 2.8   2      c
# 3.11  3      a
# 3.12  3      c
# 4.13  4   <NA>
# 4.14  4   <NA>
# 4.15  4      b
# 4.16  4      b

注意

如果并非所有 NAs 都处于领先地位,则会失败,因为将从 "a" 开头的组中删除所有 NAs 作为第一个非 NA 值。