根据指定的值差异过滤 ID
Filtering ID's based on specified difference in values
我正在尝试根据指定条件过滤 ID。例如,我想过滤在从治疗前到 post 治疗期间问卷得分有特殊差异的 ID。这个想法是让 ID 的分数有所提高,保持不变或恶化。这是我要实现的模拟数据集:-
ID<-c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition<-c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score<-c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11)
df<-cbind(ID,Condition,Score)
df<-as.data.frame(df)
df$Condition<-as.factor(df$Condition)
这里的主要问题是数据中出现了 ID,pre 和 post 两次。
我尝试使用 dplyr
解决方案来 select 主数据框中的适当列,然后使用 tidyverse
和 spread
函数进行转换到宽幅面,因为从那里我可以很容易地找出差异。但是,我遇到了一个特殊的问题。它不会起作用,因为有重复的实例,其中 ID 再次出现在数据中(例如,ID aaa、bbb 和 ccc)。
df2<-df%>%
group_by(ID)%>%
spread(Condition, Score)
这让我收到以下错误消息:-
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 12 rows:
* 10, 22
* 11, 23
* 12, 24
* 1, 19
* 2, 20
* 3, 21
Do you need to create unique ID with tibble::rowid_to_column()?
理想情况下,我想要的结果是这样的:-
#improved
ID Pre Post Difference
aaa 23 17 -6
bbb 20 17 -3
ggg 20 14 -6
hhh 19 15 -4
iii 18 10 -8
aaa 23 20 -3
bbb 23 18 -5
ccc 21 11 -10
#no improvement
ID Pre Post Difference
ccc 19 19 0
eee 22 22 0
fff 22 22 0
#worsened
ID Pre Post Difference
ddd 15 20 +5
或者类似的东西。只要它允许我包含重复的 ID。理想情况下,我希望能够根据差异的大小进一步有条件地过滤。因此,例如,如果我想要 subset/filter 个得分提高超过 5 分或得分降低超过 5 分的 ID。请记住,我的实际数据集将有比我刚刚编写并提供的示例。一如既往,我们将不胜感激任何帮助。
提前谢谢你:)
一个选项是首先将 'Score' 从 factor
转换为 numeric
,按 'ID' 'Condition' 分组,创建一个序列列('rn'), spread
转 'wide' 格式, 得到 'Post' 和 'Pre' 分数的差异, split
通过 sign
'Difference' 列创建 list
个 tibble
s
library(tidyverse)
df %>%
mutate(Score = as.numeric(as.character(Score))) %>%
group_by(ID, Condition) %>%
mutate(rn = row_number()) %>%
spread(Condition, Score) %>%
mutate(Difference = Post -Pre) %>%
ungroup %>%
select(-rn) %>%
group_split(grp = sign(Difference), keep = FALSE)
#[[1]]
# A tibble: 8 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 aaa 17 23 -6
#2 aaa 20 23 -3
#3 bbb 17 20 -3
#4 bbb 18 23 -5
#5 ccc 11 21 -10
#6 ggg 14 20 -6
#7 hhh 15 19 -4
#8 iii 10 18 -8
#[[2]]
# A tibble: 3 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 ccc 19 19 0
#2 eee 22 22 0
#3 fff 22 22 0
#[[3]]
# A tibble: 1 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 ddd 20 15 5
注意:不建议使用 as.data.frame(cbind
,因为 cbind
转换为 matrix
而 matrix
只能容纳一个 class, i.e.if 有一个字符列,所有其他列都转换为 character
并用 as.data.frame
换行(默认选项是 stringsAsFactors = TRUE
)。
df <- data.frame(...) #directly create
另一种 tidyverse
可能性是:
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Score = as.numeric(Score)) %>%
group_by(Condition) %>%
mutate(ID = make.unique(ID)) %>%
group_by(ID) %>%
mutate(Difference = Score - lag(Score)) %>%
spread(Condition, Score) %>%
summarise_all(max, na.rm = TRUE) %>%
arrange(Difference)
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ccc.1 -10 11 21
2 iii -8 10 18
3 aaa -6 17 23
4 ggg -6 14 20
5 bbb.1 -5 18 23
6 hhh -4 15 19
7 aaa.1 -3 20 23
8 bbb -3 17 20
9 ccc 0 19 19
10 eee 0 22 22
11 fff 0 22 22
12 ddd 5 20 15
在这里,它首先创建唯一 ID。其次,它计算差异。最后,将其转化为宽格式,并根据差异进行排列。
如果由于某些原因你需要根据差异拆分它,你可以添加@akrun 代码的最后一行:
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Score = as.numeric(Score)) %>%
group_by(Condition) %>%
mutate(ID = make.unique(ID)) %>%
group_by(ID) %>%
mutate(Difference = Score - lag(Score)) %>%
spread(Condition, Score) %>%
summarise_all(max, na.rm = TRUE) %>%
group_split(sign(Difference), keep = FALSE)
[[1]]
# A tibble: 8 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 aaa -6 17 23
2 aaa.1 -3 20 23
3 bbb -3 17 20
4 bbb.1 -5 18 23
5 ccc.1 -10 11 21
6 ggg -6 14 20
7 hhh -4 15 19
8 iii -8 10 18
[[2]]
# A tibble: 3 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ccc 0 19 19
2 eee 0 22 22
3 fff 0 22 22
[[3]]
# A tibble: 1 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ddd 5 20 15
其他答案解决了 Score
是 cbind()
调用的一个因素。以下是 Base R、data.table 和 dplyr 的解决方案。
所有解决方案都通过添加额外的 Group
变量来解决重复的 ID
。这允许 spread
成功。
# Base R ------------------------------------------------------------------
df <- data.frame(ID, Condition, Score)
df$Group <- ave(seq_len(nrow(df)), df$Condition, FUN = seq_along)
df_wide <- reshape(df, timevar = 'Condition', idvar = c('ID', 'Group'), direction = 'wide')
df_wide$Difference <- df_wide$Score.Post - df_wide$Score.Pre
df_wide[order(df_wide$Difference),]
# data.table --------------------------------------------------------------
library(data.table)
dt <- data.table(ID, Condition, Score)
dt[, Group := seq_len(.N), by = Condition]
dt_wide <- dcast(dt, ID + Group ~ Condition, value.var = 'Score')
dt_wide[, Difference := Post - Pre]
dt_wide[order(Difference),]
# dplyr -------------------------------------------------------------------
library(tidyverse)
tib <- tibble(ID, Condition, Score)
tib%>%
group_by(Condition)%>%
mutate(Group = row_number())%>%
ungroup()%>%
spread(key = 'Condition', value = 'Score')%>%
mutate(Difference = Post - Pre)%>%
arrange(Difference)
对于这个非常 的小数据集,基础 R 最快,data.table 最慢。
Unit: milliseconds
expr min lq mean median uq max neval
base_r_way 2.7562 2.98075 3.103155 3.05140 3.12810 6.0653 100
data.table_way 6.6137 7.09705 8.216043 7.44250 8.01885 47.9138 100
dplyr_way 4.7334 5.15005 5.350857 5.25085 5.40395 9.5594 100
和数据:
ID <- c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition <- c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score <- as.integer(c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11))
我正在尝试根据指定条件过滤 ID。例如,我想过滤在从治疗前到 post 治疗期间问卷得分有特殊差异的 ID。这个想法是让 ID 的分数有所提高,保持不变或恶化。这是我要实现的模拟数据集:-
ID<-c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition<-c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score<-c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11)
df<-cbind(ID,Condition,Score)
df<-as.data.frame(df)
df$Condition<-as.factor(df$Condition)
这里的主要问题是数据中出现了 ID,pre 和 post 两次。
我尝试使用 dplyr
解决方案来 select 主数据框中的适当列,然后使用 tidyverse
和 spread
函数进行转换到宽幅面,因为从那里我可以很容易地找出差异。但是,我遇到了一个特殊的问题。它不会起作用,因为有重复的实例,其中 ID 再次出现在数据中(例如,ID aaa、bbb 和 ccc)。
df2<-df%>%
group_by(ID)%>%
spread(Condition, Score)
这让我收到以下错误消息:-
Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 12 rows: * 10, 22 * 11, 23 * 12, 24 * 1, 19 * 2, 20 * 3, 21 Do you need to create unique ID with tibble::rowid_to_column()?
理想情况下,我想要的结果是这样的:-
#improved
ID Pre Post Difference
aaa 23 17 -6
bbb 20 17 -3
ggg 20 14 -6
hhh 19 15 -4
iii 18 10 -8
aaa 23 20 -3
bbb 23 18 -5
ccc 21 11 -10
#no improvement
ID Pre Post Difference
ccc 19 19 0
eee 22 22 0
fff 22 22 0
#worsened
ID Pre Post Difference
ddd 15 20 +5
或者类似的东西。只要它允许我包含重复的 ID。理想情况下,我希望能够根据差异的大小进一步有条件地过滤。因此,例如,如果我想要 subset/filter 个得分提高超过 5 分或得分降低超过 5 分的 ID。请记住,我的实际数据集将有比我刚刚编写并提供的示例。一如既往,我们将不胜感激任何帮助。
提前谢谢你:)
一个选项是首先将 'Score' 从 factor
转换为 numeric
,按 'ID' 'Condition' 分组,创建一个序列列('rn'), spread
转 'wide' 格式, 得到 'Post' 和 'Pre' 分数的差异, split
通过 sign
'Difference' 列创建 list
个 tibble
s
library(tidyverse)
df %>%
mutate(Score = as.numeric(as.character(Score))) %>%
group_by(ID, Condition) %>%
mutate(rn = row_number()) %>%
spread(Condition, Score) %>%
mutate(Difference = Post -Pre) %>%
ungroup %>%
select(-rn) %>%
group_split(grp = sign(Difference), keep = FALSE)
#[[1]]
# A tibble: 8 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 aaa 17 23 -6
#2 aaa 20 23 -3
#3 bbb 17 20 -3
#4 bbb 18 23 -5
#5 ccc 11 21 -10
#6 ggg 14 20 -6
#7 hhh 15 19 -4
#8 iii 10 18 -8
#[[2]]
# A tibble: 3 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 ccc 19 19 0
#2 eee 22 22 0
#3 fff 22 22 0
#[[3]]
# A tibble: 1 x 4
# ID Post Pre Difference
# <fct> <dbl> <dbl> <dbl>
#1 ddd 20 15 5
注意:不建议使用 as.data.frame(cbind
,因为 cbind
转换为 matrix
而 matrix
只能容纳一个 class, i.e.if 有一个字符列,所有其他列都转换为 character
并用 as.data.frame
换行(默认选项是 stringsAsFactors = TRUE
)。
df <- data.frame(...) #directly create
另一种 tidyverse
可能性是:
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Score = as.numeric(Score)) %>%
group_by(Condition) %>%
mutate(ID = make.unique(ID)) %>%
group_by(ID) %>%
mutate(Difference = Score - lag(Score)) %>%
spread(Condition, Score) %>%
summarise_all(max, na.rm = TRUE) %>%
arrange(Difference)
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ccc.1 -10 11 21
2 iii -8 10 18
3 aaa -6 17 23
4 ggg -6 14 20
5 bbb.1 -5 18 23
6 hhh -4 15 19
7 aaa.1 -3 20 23
8 bbb -3 17 20
9 ccc 0 19 19
10 eee 0 22 22
11 fff 0 22 22
12 ddd 5 20 15
在这里,它首先创建唯一 ID。其次,它计算差异。最后,将其转化为宽格式,并根据差异进行排列。
如果由于某些原因你需要根据差异拆分它,你可以添加@akrun 代码的最后一行:
df %>%
mutate_if(is.factor, as.character) %>%
mutate(Score = as.numeric(Score)) %>%
group_by(Condition) %>%
mutate(ID = make.unique(ID)) %>%
group_by(ID) %>%
mutate(Difference = Score - lag(Score)) %>%
spread(Condition, Score) %>%
summarise_all(max, na.rm = TRUE) %>%
group_split(sign(Difference), keep = FALSE)
[[1]]
# A tibble: 8 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 aaa -6 17 23
2 aaa.1 -3 20 23
3 bbb -3 17 20
4 bbb.1 -5 18 23
5 ccc.1 -10 11 21
6 ggg -6 14 20
7 hhh -4 15 19
8 iii -8 10 18
[[2]]
# A tibble: 3 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ccc 0 19 19
2 eee 0 22 22
3 fff 0 22 22
[[3]]
# A tibble: 1 x 4
ID Difference Post Pre
<chr> <dbl> <dbl> <dbl>
1 ddd 5 20 15
其他答案解决了 Score
是 cbind()
调用的一个因素。以下是 Base R、data.table 和 dplyr 的解决方案。
所有解决方案都通过添加额外的 Group
变量来解决重复的 ID
。这允许 spread
成功。
# Base R ------------------------------------------------------------------
df <- data.frame(ID, Condition, Score)
df$Group <- ave(seq_len(nrow(df)), df$Condition, FUN = seq_along)
df_wide <- reshape(df, timevar = 'Condition', idvar = c('ID', 'Group'), direction = 'wide')
df_wide$Difference <- df_wide$Score.Post - df_wide$Score.Pre
df_wide[order(df_wide$Difference),]
# data.table --------------------------------------------------------------
library(data.table)
dt <- data.table(ID, Condition, Score)
dt[, Group := seq_len(.N), by = Condition]
dt_wide <- dcast(dt, ID + Group ~ Condition, value.var = 'Score')
dt_wide[, Difference := Post - Pre]
dt_wide[order(Difference),]
# dplyr -------------------------------------------------------------------
library(tidyverse)
tib <- tibble(ID, Condition, Score)
tib%>%
group_by(Condition)%>%
mutate(Group = row_number())%>%
ungroup()%>%
spread(key = 'Condition', value = 'Score')%>%
mutate(Difference = Post - Pre)%>%
arrange(Difference)
对于这个非常 的小数据集,基础 R 最快,data.table 最慢。
Unit: milliseconds
expr min lq mean median uq max neval
base_r_way 2.7562 2.98075 3.103155 3.05140 3.12810 6.0653 100
data.table_way 6.6137 7.09705 8.216043 7.44250 8.01885 47.9138 100
dplyr_way 4.7334 5.15005 5.350857 5.25085 5.40395 9.5594 100
和数据:
ID <- c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition <- c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score <- as.integer(c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11))