按组有条件地删除重复的行
Conditionally remove duplicated rows by group
我进行了一项调查,我的数据如下所示:
dt<-structure(list(ID = c("183577", "183577", "183907", "183907",
"184188", "184188", "184188", "184188", "184188", "185167", "185167",
"185167"), Question = c("7.6", "7.6", "7.7", "7.7", "1.1", "1.1",
"1.2", "1.2", "10.1", "7.7", "7.7", "7.7"), Answer = c("PARTIALLY",
"YES", "", "", "", "PARTIALLY", "YES", "", "", "", "YES", "PARTIALLY"
), Control = c(-2.93736019374946, 1.01801705406142, 0.0598708062395571,
-0.456635810693228, 3.04311151438148, 0.641092485370467, 0.518503165474265,
0.284680056109131, 1.98580865602238, -0.547063974950295, -0.507700003072695,
-0.194028453167317)), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), index = integer(0))
dt
ID Question Answer Control
1: 183577 7.6 PARTIALLY -2.93736019
2: 183577 7.6 YES 1.01801705
3: 183907 7.7 0.05987081
4: 183907 7.7 -0.45663581
5: 184188 1.1 3.04311151
6: 184188 1.1 PARTIALLY 0.64109249
7: 184188 1.2 YES 0.51850317
8: 184188 1.2 0.28468006
9: 184188 10.1 1.98580866
10: 185167 7.7 -0.54706397
11: 185167 7.7 YES -0.50770000
12: 185167 7.7 PARTIALLY -0.19402845
在 Answer 变量中,有几个缺失值 ""
。然而,对于其中一些问题,我有另一行包含个人的答案,其中该行采用非缺失值(例如,部分、是、否)。
我想删除所有重复的行,其中我的答案是缺失值 (Answer==""
),另一个答案是真实值。因此,例如,删除第 5 行并保留第 6 行。但是,当 none 的答案具有非缺失值(例如第 3 行和第 4 行以及第 9 行)时,我想保留具有缺失值的观察结果.
有谁知道我是怎么做到的?
最终数据集应如下所示
ID Question Answer Control
1: 183577 7.6 PARTIALLY -2.93736019
2: 183577 7.6 YES 1.01801705
3: 183907 7.7 0.05987081
4: 183907 7.7 -0.45663581
5: 184188 1.1 PARTIALLY 0.64109249
6: 184188 1.2 YES 0.51850317
7: 184188 10.1 1.98580866
8: 185167 7.7 YES -0.50770000
9: 185167 7.7 PARTIALLY -0.19402845
请注意第 1 行和第 2 行的特殊情况。我有两个不同的非缺失值。因为它也反映在最终数据集中,所以在这种情况下我想保留这两个观察结果。
谢谢
试试这个,我们根据满足的两个条件之一进行过滤。 1) 如果所有答案都为空则保留该行 2) 如果答案不为空则保留该行
library(tidyverse)
dt %>% group_by(ID, Question) %>%
filter(all(Answer == "") | (Answer != ""))
# A tibble: 9 x 4
# Groups: ID, Question [6]
ID Question Answer Control
<chr> <chr> <chr> <dbl>
1 183577 7.6 "PARTIALLY" -2.94
2 183577 7.6 "YES" 1.02
3 183907 7.7 "" 0.0599
4 183907 7.7 "" -0.457
5 184188 1.1 "PARTIALLY" 0.641
6 184188 1.2 "YES" 0.519
7 184188 10.1 "" 1.99
8 185167 7.7 "YES" -0.508
9 185167 7.7 "PARTIALLY" -0.194
我进行了一项调查,我的数据如下所示:
dt<-structure(list(ID = c("183577", "183577", "183907", "183907",
"184188", "184188", "184188", "184188", "184188", "185167", "185167",
"185167"), Question = c("7.6", "7.6", "7.7", "7.7", "1.1", "1.1",
"1.2", "1.2", "10.1", "7.7", "7.7", "7.7"), Answer = c("PARTIALLY",
"YES", "", "", "", "PARTIALLY", "YES", "", "", "", "YES", "PARTIALLY"
), Control = c(-2.93736019374946, 1.01801705406142, 0.0598708062395571,
-0.456635810693228, 3.04311151438148, 0.641092485370467, 0.518503165474265,
0.284680056109131, 1.98580865602238, -0.547063974950295, -0.507700003072695,
-0.194028453167317)), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), index = integer(0))
dt
ID Question Answer Control
1: 183577 7.6 PARTIALLY -2.93736019
2: 183577 7.6 YES 1.01801705
3: 183907 7.7 0.05987081
4: 183907 7.7 -0.45663581
5: 184188 1.1 3.04311151
6: 184188 1.1 PARTIALLY 0.64109249
7: 184188 1.2 YES 0.51850317
8: 184188 1.2 0.28468006
9: 184188 10.1 1.98580866
10: 185167 7.7 -0.54706397
11: 185167 7.7 YES -0.50770000
12: 185167 7.7 PARTIALLY -0.19402845
在 Answer 变量中,有几个缺失值 ""
。然而,对于其中一些问题,我有另一行包含个人的答案,其中该行采用非缺失值(例如,部分、是、否)。
我想删除所有重复的行,其中我的答案是缺失值 (Answer==""
),另一个答案是真实值。因此,例如,删除第 5 行并保留第 6 行。但是,当 none 的答案具有非缺失值(例如第 3 行和第 4 行以及第 9 行)时,我想保留具有缺失值的观察结果.
有谁知道我是怎么做到的?
最终数据集应如下所示
ID Question Answer Control
1: 183577 7.6 PARTIALLY -2.93736019
2: 183577 7.6 YES 1.01801705
3: 183907 7.7 0.05987081
4: 183907 7.7 -0.45663581
5: 184188 1.1 PARTIALLY 0.64109249
6: 184188 1.2 YES 0.51850317
7: 184188 10.1 1.98580866
8: 185167 7.7 YES -0.50770000
9: 185167 7.7 PARTIALLY -0.19402845
请注意第 1 行和第 2 行的特殊情况。我有两个不同的非缺失值。因为它也反映在最终数据集中,所以在这种情况下我想保留这两个观察结果。
谢谢
试试这个,我们根据满足的两个条件之一进行过滤。 1) 如果所有答案都为空则保留该行 2) 如果答案不为空则保留该行
library(tidyverse)
dt %>% group_by(ID, Question) %>%
filter(all(Answer == "") | (Answer != ""))
# A tibble: 9 x 4
# Groups: ID, Question [6]
ID Question Answer Control
<chr> <chr> <chr> <dbl>
1 183577 7.6 "PARTIALLY" -2.94
2 183577 7.6 "YES" 1.02
3 183907 7.7 "" 0.0599
4 183907 7.7 "" -0.457
5 184188 1.1 "PARTIALLY" 0.641
6 184188 1.2 "YES" 0.519
7 184188 10.1 "" 1.99
8 185167 7.7 "YES" -0.508
9 185167 7.7 "PARTIALLY" -0.194