如果行没有至少 2 个多变量记录,则删除/行
Deleting/ rows if rows do not have atleast 2 recordings for multiple variable
问题 1)
假设我有 4 名参与者超过 4 年的纵向数据,即第 0、1、3、4 年。
我的目标是
- 检查数据是否记录了至少任意 2 个时间点的结果变量 (n1)。
- 如果只有一个录音则删除;否则保留它。
- 对多个结果变量 (m1) 重复 1) 和 2)
我有数据
ID visit n1 m1
1 0 5.6 0
1 1 1.5 NA
1 3 0.5 NA
1 4 NA NA
2 0 6 1
2 1 NA 0
2 3 NA 0
2 4 NA 0
3 0 3.4 0
3 1 2.4 0
3 3 2.5 0
3 4 1 1
4 0 NA NA
4 1 NA NA
4 3 NA NA
4 4 3.3 0
这就是我想要的
data 1
ID visit n1
1 0 5.6
1 1 1.5
1 3 0.5
1 4 NA
3 0 3.4
3 1 2.4
3 3 2.5
3 4 1
data2
ID visit m1
2 0 1
2 1 0
2 3 0
2 4 0
3 0 0
3 1 0
3 3 0
3 4 1
或者我们创建新变量 n12 的这种形式(0= <2 个值存在于 n1 与 1= >=2 个值存在于 n1 中)和类似的 m12。稍后我可以根据这些新变量 n12 和 m12 的值删除行。
ID visit n1 m1 n12 m12
1 0 5.6 0 1 0
1 1 1.5 NA 1 0
1 3 0.5 NA 1 0
1 4 NA NA 1 0
2 0 6 1 0 1
2 1 NA 0 0 1
2 3 NA 0 0 1
2 4 NA 0 0 1
3 0 3.4 0 1 1
3 1 2.4 0 1 1
3 3 2.5 0 1 1
3 4 1 1 1 1
4 0 NA NA 0 0
4 1 NA NA 0 0
4 3 NA NA 0 0
4 4 3.3 0 0 0
我试过了
但是以下代码在 mydata 中给出了“0”观察结果,因为即使在行中找到单个 NA,它也会删除行
mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]
library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]
我也尝试了从长到宽的格式,这没有用,因为我猜我的情况下的值是多个变量
library(dplyr)
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format
新问题 2:如果我只希望 n1 的访问值为 0 的行和 n1 的至少一次其他访问 (1/3/4) 记录,我应该如何编码?获取这样的数据:
ID visit n1
1 0 5.6
1 1 1.5
1 3 0.5
1 4 NA
3 0 3.4
3 1 2.4
3 3 2.5
3 4 1
请建议 R 语法或方法来实现目标
谢谢!
dat <- structure(list(
ID = c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4"),
visit = c("0", "1", "3", "4", "0", "1", "3", "4", "0", "1", "3", "4", "0", "1", "3","4"),
n1 = c("5.6", "1.5", "0.5", NA, "6", NA, NA, NA, "3.4","2.4", "2.5", "1", NA, NA, NA, "3.3"),
m1 = c("0", NA, NA, NA, "1", "0", "0", "0", "0", "0", "0", "1", NA, NA, NA, "0")),
row.names = 2:17, class = "data.frame")
library(dplyr)
dat %>%
group_by(ID) %>%
filter(sum(!is.na(n1)) >= 2) %>%
assign("data1", ., inherits = TRUE)
data1
#> # A tibble: 8 x 4
#> # Groups: ID [2]
#> ID visit n1 m1
#> <chr> <chr> <chr> <chr>
#> 1 1 0 5.6 0
#> 2 1 1 1.5 <NA>
#> 3 1 3 0.5 <NA>
#> 4 1 4 <NA> <NA>
#> 5 3 0 3.4 0
#> 6 3 1 2.4 0
#> 7 3 3 2.5 0
#> 8 3 4 1 1
dat %>%
group_by(ID) %>%
filter(sum(!is.na(m1)) >= 2) %>%
assign("data2", ., inherits = TRUE)
data2
#> # A tibble: 8 x 4
#> # Groups: ID [2]
#> ID visit n1 m1
#> <chr> <chr> <chr> <chr>
#> 1 2 0 6 1
#> 2 2 1 <NA> 0
#> 3 2 3 <NA> 0
#> 4 2 4 <NA> 0
#> 5 3 0 3.4 0
#> 6 3 1 2.4 0
#> 7 3 3 2.5 0
#> 8 3 4 1 1
dat %>%
group_by(ID) %>%
mutate(n12 = ifelse(sum(!is.na(n1)) >= 2, 1, 0)) %>%
mutate(m12 = ifelse(sum(!is.na(m1)) >= 2, 1, 0))
#> # A tibble: 16 x 6
#> # Groups: ID [4]
#> ID visit n1 m1 n12 m12
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 1 0 5.6 0 1 0
#> 2 1 1 1.5 <NA> 1 0
#> 3 1 3 0.5 <NA> 1 0
#> 4 1 4 <NA> <NA> 1 0
#> 5 2 0 6 1 0 1
#> 6 2 1 <NA> 0 0 1
#> 7 2 3 <NA> 0 0 1
#> 8 2 4 <NA> 0 0 1
#> 9 3 0 3.4 0 1 1
#> 10 3 1 2.4 0 1 1
#> 11 3 3 2.5 0 1 1
#> 12 3 4 1 1 1 1
#> 13 4 0 <NA> <NA> 0 0
#> 14 4 1 <NA> <NA> 0 0
#> 15 4 3 <NA> <NA> 0 0
#> 16 4 4 3.3 0 0 0
Created on 2021-12-16 by the reprex package (v2.0.1)
行为示范:
# Let's look at one ID
# group_by() is essentially doing the same thing
# i.e., summarizing by group
id1 <- dat[dat$ID==1,]
id1
#> ID visit n1 m1
#> 2 1 0 5.6 0
#> 3 1 1 1.5 <NA>
#> 4 1 3 0.5 <NA>
#> 5 1 4 <NA> <NA>
# Test for NA in variable n1
is.na(id1$n1)
#> [1] FALSE FALSE FALSE TRUE
# Identify values that aren't NA
!is.na(id1$n1)
#> [1] TRUE TRUE TRUE FALSE
# Count values that aren't NA
sum(!is.na(id1$n1))
#> [1] 3
# For the first approach:
#
# Filter() keeps values that meet the criteria
# In this case, we are keeping IDs with more than 2 non-NAs in the focal variable (n1)
#
# Create object in the environment based on this dplyr chain
# named "data2";
# using data from dplyr chain ".";
# inherits=TRUE makes it available outside of the chain
assign(x="data1", value=., inherits=TRUE)
# For the second approach:
#
# If the number of non-NAs is greater than 2 (per your criteria),
# return a 1, otherwise 0, in a new column
# A result of 1 here indicates this ID has two or more values that are non-NA
ifelse(test=sum(!is.na(id1$n1)) >= 2, yes=1, no=0)
#> [1] 1
问题 1) 假设我有 4 名参与者超过 4 年的纵向数据,即第 0、1、3、4 年。 我的目标是
- 检查数据是否记录了至少任意 2 个时间点的结果变量 (n1)。
- 如果只有一个录音则删除;否则保留它。
- 对多个结果变量 (m1) 重复 1) 和 2)
我有数据
ID visit n1 m1
1 0 5.6 0
1 1 1.5 NA
1 3 0.5 NA
1 4 NA NA
2 0 6 1
2 1 NA 0
2 3 NA 0
2 4 NA 0
3 0 3.4 0
3 1 2.4 0
3 3 2.5 0
3 4 1 1
4 0 NA NA
4 1 NA NA
4 3 NA NA
4 4 3.3 0
这就是我想要的
data 1
ID visit n1
1 0 5.6
1 1 1.5
1 3 0.5
1 4 NA
3 0 3.4
3 1 2.4
3 3 2.5
3 4 1
data2
ID visit m1
2 0 1
2 1 0
2 3 0
2 4 0
3 0 0
3 1 0
3 3 0
3 4 1
或者我们创建新变量 n12 的这种形式(0= <2 个值存在于 n1 与 1= >=2 个值存在于 n1 中)和类似的 m12。稍后我可以根据这些新变量 n12 和 m12 的值删除行。
ID visit n1 m1 n12 m12
1 0 5.6 0 1 0
1 1 1.5 NA 1 0
1 3 0.5 NA 1 0
1 4 NA NA 1 0
2 0 6 1 0 1
2 1 NA 0 0 1
2 3 NA 0 0 1
2 4 NA 0 0 1
3 0 3.4 0 1 1
3 1 2.4 0 1 1
3 3 2.5 0 1 1
3 4 1 1 1 1
4 0 NA NA 0 0
4 1 NA NA 0 0
4 3 NA NA 0 0
4 4 3.3 0 0 0
我试过了
mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]
library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]
我也尝试了从长到宽的格式,这没有用,因为我猜我的情况下的值是多个变量
library(dplyr)
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format
新问题 2:如果我只希望 n1 的访问值为 0 的行和 n1 的至少一次其他访问 (1/3/4) 记录,我应该如何编码?获取这样的数据:
ID visit n1
1 0 5.6
1 1 1.5
1 3 0.5
1 4 NA
3 0 3.4
3 1 2.4
3 3 2.5
3 4 1
请建议 R 语法或方法来实现目标 谢谢!
dat <- structure(list(
ID = c("1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4"),
visit = c("0", "1", "3", "4", "0", "1", "3", "4", "0", "1", "3", "4", "0", "1", "3","4"),
n1 = c("5.6", "1.5", "0.5", NA, "6", NA, NA, NA, "3.4","2.4", "2.5", "1", NA, NA, NA, "3.3"),
m1 = c("0", NA, NA, NA, "1", "0", "0", "0", "0", "0", "0", "1", NA, NA, NA, "0")),
row.names = 2:17, class = "data.frame")
library(dplyr)
dat %>%
group_by(ID) %>%
filter(sum(!is.na(n1)) >= 2) %>%
assign("data1", ., inherits = TRUE)
data1
#> # A tibble: 8 x 4
#> # Groups: ID [2]
#> ID visit n1 m1
#> <chr> <chr> <chr> <chr>
#> 1 1 0 5.6 0
#> 2 1 1 1.5 <NA>
#> 3 1 3 0.5 <NA>
#> 4 1 4 <NA> <NA>
#> 5 3 0 3.4 0
#> 6 3 1 2.4 0
#> 7 3 3 2.5 0
#> 8 3 4 1 1
dat %>%
group_by(ID) %>%
filter(sum(!is.na(m1)) >= 2) %>%
assign("data2", ., inherits = TRUE)
data2
#> # A tibble: 8 x 4
#> # Groups: ID [2]
#> ID visit n1 m1
#> <chr> <chr> <chr> <chr>
#> 1 2 0 6 1
#> 2 2 1 <NA> 0
#> 3 2 3 <NA> 0
#> 4 2 4 <NA> 0
#> 5 3 0 3.4 0
#> 6 3 1 2.4 0
#> 7 3 3 2.5 0
#> 8 3 4 1 1
dat %>%
group_by(ID) %>%
mutate(n12 = ifelse(sum(!is.na(n1)) >= 2, 1, 0)) %>%
mutate(m12 = ifelse(sum(!is.na(m1)) >= 2, 1, 0))
#> # A tibble: 16 x 6
#> # Groups: ID [4]
#> ID visit n1 m1 n12 m12
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 1 0 5.6 0 1 0
#> 2 1 1 1.5 <NA> 1 0
#> 3 1 3 0.5 <NA> 1 0
#> 4 1 4 <NA> <NA> 1 0
#> 5 2 0 6 1 0 1
#> 6 2 1 <NA> 0 0 1
#> 7 2 3 <NA> 0 0 1
#> 8 2 4 <NA> 0 0 1
#> 9 3 0 3.4 0 1 1
#> 10 3 1 2.4 0 1 1
#> 11 3 3 2.5 0 1 1
#> 12 3 4 1 1 1 1
#> 13 4 0 <NA> <NA> 0 0
#> 14 4 1 <NA> <NA> 0 0
#> 15 4 3 <NA> <NA> 0 0
#> 16 4 4 3.3 0 0 0
Created on 2021-12-16 by the reprex package (v2.0.1)
行为示范:
# Let's look at one ID
# group_by() is essentially doing the same thing
# i.e., summarizing by group
id1 <- dat[dat$ID==1,]
id1
#> ID visit n1 m1
#> 2 1 0 5.6 0
#> 3 1 1 1.5 <NA>
#> 4 1 3 0.5 <NA>
#> 5 1 4 <NA> <NA>
# Test for NA in variable n1
is.na(id1$n1)
#> [1] FALSE FALSE FALSE TRUE
# Identify values that aren't NA
!is.na(id1$n1)
#> [1] TRUE TRUE TRUE FALSE
# Count values that aren't NA
sum(!is.na(id1$n1))
#> [1] 3
# For the first approach:
#
# Filter() keeps values that meet the criteria
# In this case, we are keeping IDs with more than 2 non-NAs in the focal variable (n1)
#
# Create object in the environment based on this dplyr chain
# named "data2";
# using data from dplyr chain ".";
# inherits=TRUE makes it available outside of the chain
assign(x="data1", value=., inherits=TRUE)
# For the second approach:
#
# If the number of non-NAs is greater than 2 (per your criteria),
# return a 1, otherwise 0, in a new column
# A result of 1 here indicates this ID has two or more values that are non-NA
ifelse(test=sum(!is.na(id1$n1)) >= 2, yes=1, no=0)
#> [1] 1