如果行没有至少 2 个多变量记录,则删除/行

Deleting/ rows if rows do not have atleast 2 recordings for multiple variable

问题 1) 假设我有 4 名参与者超过 4 年的纵向数据,即第 0、1、3、4 年。 我的目标是

  1. 检查数据是否记录了至少任意 2 个时间点的结果变量 (n1)。
  2. 如果只有一个录音则删除;否则保留它。
  3. 对多个结果变量 (m1) 重复 1) 和 2)

我有数据

ID  visit   n1  m1
1   0   5.6 0
1   1   1.5 NA
1   3   0.5 NA
1   4   NA  NA
2   0   6   1
2   1   NA  0
2   3   NA  0
2   4   NA  0
3   0   3.4 0
3   1   2.4 0
3   3   2.5 0
3   4   1   1
4   0   NA  NA
4   1   NA  NA
4   3   NA  NA
4   4   3.3 0

这就是我想要的

data 1       
ID  visit   n1
1   0   5.6
1   1   1.5
1   3   0.5
1   4   NA
3   0   3.4
3   1   2.4
3   3   2.5
3   4   1

data2        
ID  visit   m1
2   0   1
2   1   0
2   3   0
2   4   0
3   0   0
3   1   0
3   3   0
3   4   1

或者我们创建新变量 n12 的这种形式(0= <2 个值存在于 n1 与 1= >=2 个值存在于 n1 中)和类似的 m12。稍后我可以根据这些新变量 n12 和 m12 的值删除行。

ID  visit   n1  m1  n12 m12
1    0   5.6 0  1   0
1    1   1.5 NA 1   0
1    3   0.5 NA 1   0
1    4   NA  NA 1   0
2    0   6   1  0   1
2    1   NA  0  0   1
2    3   NA  0  0   1
2    4   NA  0  0   1
3    0   3.4 0  1   1
3    1   2.4 0  1   1
3    3   2.5 0  1   1
3    4   1   1  1   1
4    0   NA  NA 0   0
4    1   NA  NA 0   0
4    3   NA  NA 0   0
4    4   3.3 0  0   0

我试过了 但是以下代码在 mydata 中给出了“0”观察结果,因为即使在行中找到单个 NA,它也会删除行

mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]

library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]

我也尝试了从长到宽的格式,这没有用,因为我猜我的情况下的值是多个变量

library(dplyr)    
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format

新问题 2:如果我只希望 n1 的访问值为 0 的行和 n1 的至少一次其他访问 (1/3/4) 记录,我应该如何编码?获取这样的数据:

ID  visit   n1  
1   0   5.6 
1   1   1.5 
1   3   0.5 
1   4   NA  
3   0   3.4 
3   1   2.4 
3   3   2.5 
3   4   1

请建议 R 语法或方法来实现目标 谢谢!

dat <- structure(list(
  ID = c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4"),
  visit = c("0", "1", "3", "4", "0", "1", "3", "4", "0", "1", "3", "4", "0", "1", "3","4"),
  n1 = c("5.6", "1.5", "0.5", NA, "6", NA, NA, NA, "3.4","2.4", "2.5", "1", NA, NA, NA, "3.3"),
  m1 = c("0", NA, NA, NA, "1", "0", "0", "0", "0", "0", "0", "1", NA, NA, NA, "0")),
  row.names = 2:17, class = "data.frame")

library(dplyr)

dat %>%
  group_by(ID) %>%
  filter(sum(!is.na(n1)) >= 2) %>%
  assign("data1", ., inherits = TRUE)
data1
#> # A tibble: 8 x 4
#> # Groups:   ID [2]
#>   ID    visit n1    m1   
#>   <chr> <chr> <chr> <chr>
#> 1 1     0     5.6   0    
#> 2 1     1     1.5   <NA> 
#> 3 1     3     0.5   <NA> 
#> 4 1     4     <NA>  <NA> 
#> 5 3     0     3.4   0    
#> 6 3     1     2.4   0    
#> 7 3     3     2.5   0    
#> 8 3     4     1     1

dat %>%
  group_by(ID) %>%
  filter(sum(!is.na(m1)) >= 2) %>%
  assign("data2", ., inherits = TRUE)
data2
#> # A tibble: 8 x 4
#> # Groups:   ID [2]
#>   ID    visit n1    m1   
#>   <chr> <chr> <chr> <chr>
#> 1 2     0     6     1    
#> 2 2     1     <NA>  0    
#> 3 2     3     <NA>  0    
#> 4 2     4     <NA>  0    
#> 5 3     0     3.4   0    
#> 6 3     1     2.4   0    
#> 7 3     3     2.5   0    
#> 8 3     4     1     1

dat %>%
  group_by(ID) %>%
  mutate(n12 = ifelse(sum(!is.na(n1)) >= 2, 1, 0)) %>%
  mutate(m12 = ifelse(sum(!is.na(m1)) >= 2, 1, 0))
#> # A tibble: 16 x 6
#> # Groups:   ID [4]
#>    ID    visit n1    m1      n12   m12
#>    <chr> <chr> <chr> <chr> <dbl> <dbl>
#>  1 1     0     5.6   0         1     0
#>  2 1     1     1.5   <NA>      1     0
#>  3 1     3     0.5   <NA>      1     0
#>  4 1     4     <NA>  <NA>      1     0
#>  5 2     0     6     1         0     1
#>  6 2     1     <NA>  0         0     1
#>  7 2     3     <NA>  0         0     1
#>  8 2     4     <NA>  0         0     1
#>  9 3     0     3.4   0         1     1
#> 10 3     1     2.4   0         1     1
#> 11 3     3     2.5   0         1     1
#> 12 3     4     1     1         1     1
#> 13 4     0     <NA>  <NA>      0     0
#> 14 4     1     <NA>  <NA>      0     0
#> 15 4     3     <NA>  <NA>      0     0
#> 16 4     4     3.3   0         0     0
Created on 2021-12-16 by the reprex package (v2.0.1)

行为示范:

# Let's look at one ID
# group_by() is essentially doing the same thing
# i.e., summarizing by group
id1 <- dat[dat$ID==1,]
id1
#>   ID visit   n1   m1
#> 2  1     0  5.6    0
#> 3  1     1  1.5 <NA>
#> 4  1     3  0.5 <NA>
#> 5  1     4 <NA> <NA>

# Test for NA in variable n1
is.na(id1$n1)
#> [1] FALSE FALSE FALSE  TRUE

# Identify values that aren't NA
!is.na(id1$n1)
#> [1]  TRUE  TRUE  TRUE FALSE

# Count values that aren't NA
sum(!is.na(id1$n1))
#> [1] 3

# For the first approach:
#
# Filter() keeps values that meet the criteria
# In this case, we are keeping IDs with more than 2 non-NAs in the focal variable (n1)
#
# Create object in the environment based on this dplyr chain
# named "data2";
# using data from dplyr chain ".";
# inherits=TRUE makes it available outside of the chain
assign(x="data1", value=., inherits=TRUE)

# For the second approach:
#
# If the number of non-NAs is greater than 2 (per your criteria),
# return a 1, otherwise 0, in a new column
# A result of 1 here indicates this ID has two or more values that are non-NA
ifelse(test=sum(!is.na(id1$n1)) >= 2, yes=1, no=0)
#> [1] 1