根据指定的值差异过滤 ID

Question

我正在尝试根据指定条件过滤 ID。例如，我想过滤在从治疗前到 post 治疗期间问卷得分有特殊差异的 ID。这个想法是让 ID 的分数有所提高，保持不变或恶化。这是我要实现的模拟数据集：-

    ID<-c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
    Condition<-c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
    Score<-c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11)
    df<-cbind(ID,Condition,Score)
    df<-as.data.frame(df)
    df$Condition<-as.factor(df$Condition)

这里的主要问题是数据中出现了 ID，pre 和 post 两次。

我尝试使用 dplyr 解决方案来 select 主数据框中的适当列，然后使用 tidyverse 和 spread 函数进行转换到宽幅面，因为从那里我可以很容易地找出差异。但是，我遇到了一个特殊的问题。它不会起作用，因为有重复的实例，其中 ID 再次出现在数据中（例如，ID aaa、bbb 和 ccc）。

     df2<-df%>%
     group_by(ID)%>%
     spread(Condition, Score)

这让我收到以下错误消息：-

Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 12 rows: * 10, 22 * 11, 23 * 12, 24 * 1, 19 * 2, 20 * 3, 21 Do you need to create unique ID with tibble::rowid_to_column()?

理想情况下，我想要的结果是这样的：-

    #improved
    ID      Pre       Post     Difference
    aaa      23        17           -6
    bbb      20        17           -3
    ggg      20        14           -6
    hhh      19        15           -4
    iii      18        10           -8
    aaa      23        20           -3
    bbb      23        18           -5
    ccc      21        11           -10


    #no improvement
    ID      Pre       Post      Difference
    ccc      19         19          0
    eee      22         22          0
    fff      22         22          0


    #worsened
    ID      Pre       Post      Difference
    ddd      15         20          +5

或者类似的东西。只要它允许我包含重复的 ID。理想情况下，我希望能够根据差异的大小进一步有条件地过滤。因此，例如，如果我想要 subset/filter 个得分提高超过 5 分或得分降低超过 5 分的 ID。请记住，我的实际数据集将有比我刚刚编写并提供的示例。一如既往，我们将不胜感激任何帮助。

提前谢谢你:)

Answer 1

一个选项是首先将 'Score' 从 factor 转换为 numeric，按 'ID' 'Condition' 分组，创建一个序列列（'rn'), spread 转 'wide' 格式, 得到 'Post' 和 'Pre' 分数的差异, split 通过 sign 'Difference' 列创建 list 个 tibbles

library(tidyverse)
df %>% 
   mutate(Score = as.numeric(as.character(Score))) %>% 
   group_by(ID, Condition) %>% 
   mutate(rn = row_number()) %>% 
   spread(Condition, Score) %>% 
   mutate(Difference = Post -Pre) %>% 
   ungroup %>% 
   select(-rn) %>%
   group_split(grp = sign(Difference), keep = FALSE)
#[[1]]
# A tibble: 8 x 4
#  ID     Post   Pre Difference
#  <fct> <dbl> <dbl>      <dbl>
#1 aaa      17    23         -6
#2 aaa      20    23         -3
#3 bbb      17    20         -3
#4 bbb      18    23         -5
#5 ccc      11    21        -10
#6 ggg      14    20         -6
#7 hhh      15    19         -4
#8 iii      10    18         -8

#[[2]]
# A tibble: 3 x 4
#  ID     Post   Pre Difference
#  <fct> <dbl> <dbl>      <dbl>
#1 ccc      19    19          0
#2 eee      22    22          0
#3 fff      22    22          0

#[[3]]
# A tibble: 1 x 4
#  ID     Post   Pre Difference
#  <fct> <dbl> <dbl>      <dbl>
#1 ddd      20    15          5

注意：不建议使用 as.data.frame(cbind，因为 cbind 转换为 matrix 而 matrix 只能容纳一个 class， i.e.if 有一个字符列，所有其他列都转换为 character 并用 as.data.frame 换行（默认选项是 stringsAsFactors = TRUE）。

df <- data.frame(...) #directly create

Answer 2

另一种 tidyverse 可能性是：

df %>%
 mutate_if(is.factor, as.character) %>%
 mutate(Score = as.numeric(Score)) %>%
 group_by(Condition) %>%
 mutate(ID = make.unique(ID)) %>%
 group_by(ID) %>%
 mutate(Difference = Score - lag(Score)) %>%
 spread(Condition, Score) %>%
 summarise_all(max, na.rm = TRUE) %>%
 arrange(Difference)

   ID    Difference  Post   Pre
   <chr>      <dbl> <dbl> <dbl>
 1 ccc.1        -10    11    21
 2 iii           -8    10    18
 3 aaa           -6    17    23
 4 ggg           -6    14    20
 5 bbb.1         -5    18    23
 6 hhh           -4    15    19
 7 aaa.1         -3    20    23
 8 bbb           -3    17    20
 9 ccc            0    19    19
10 eee            0    22    22
11 fff            0    22    22
12 ddd            5    20    15

在这里，它首先创建唯一 ID。其次，它计算差异。最后，将其转化为宽格式，并根据差异进行排列。

如果由于某些原因你需要根据差异拆分它，你可以添加@akrun 代码的最后一行：

df %>%
 mutate_if(is.factor, as.character) %>%
 mutate(Score = as.numeric(Score)) %>%
 group_by(Condition) %>%
 mutate(ID = make.unique(ID)) %>%
 group_by(ID) %>%
 mutate(Difference = Score - lag(Score)) %>%
 spread(Condition, Score) %>%
 summarise_all(max, na.rm = TRUE) %>%
 group_split(sign(Difference), keep = FALSE)

[[1]]
# A tibble: 8 x 4
  ID    Difference  Post   Pre
  <chr>      <dbl> <dbl> <dbl>
1 aaa           -6    17    23
2 aaa.1         -3    20    23
3 bbb           -3    17    20
4 bbb.1         -5    18    23
5 ccc.1        -10    11    21
6 ggg           -6    14    20
7 hhh           -4    15    19
8 iii           -8    10    18

[[2]]
# A tibble: 3 x 4
  ID    Difference  Post   Pre
  <chr>      <dbl> <dbl> <dbl>
1 ccc            0    19    19
2 eee            0    22    22
3 fff            0    22    22

[[3]]
# A tibble: 1 x 4
  ID    Difference  Post   Pre
  <chr>      <dbl> <dbl> <dbl>
1 ddd            5    20    15

Answer 3

其他答案解决了 Score 是 cbind() 调用的一个因素。以下是 Base R、data.table 和 dplyr 的解决方案。

所有解决方案都通过添加额外的 Group 变量来解决重复的 ID。这允许 spread 成功。

# Base R ------------------------------------------------------------------

df <- data.frame(ID, Condition, Score)
df$Group <- ave(seq_len(nrow(df)), df$Condition, FUN = seq_along)

df_wide <- reshape(df, timevar = 'Condition', idvar = c('ID', 'Group'), direction = 'wide')
df_wide$Difference <- df_wide$Score.Post - df_wide$Score.Pre
df_wide[order(df_wide$Difference),]

# data.table --------------------------------------------------------------
library(data.table)

dt <- data.table(ID, Condition, Score)
dt[, Group := seq_len(.N), by = Condition]

dt_wide <- dcast(dt, ID + Group ~ Condition, value.var = 'Score')
dt_wide[, Difference := Post - Pre]
dt_wide[order(Difference),]

# dplyr -------------------------------------------------------------------
library(tidyverse)

tib <- tibble(ID, Condition, Score)

tib%>%
  group_by(Condition)%>%
  mutate(Group = row_number())%>%
  ungroup()%>%
  spread(key = 'Condition', value = 'Score')%>%
  mutate(Difference = Post - Pre)%>%
  arrange(Difference)

对于这个非常的小数据集，基础 R 最快，data.table 最慢。

Unit: milliseconds
           expr    min      lq     mean  median      uq     max neval
     base_r_way 2.7562 2.98075 3.103155 3.05140 3.12810  6.0653   100
 data.table_way 6.6137 7.09705 8.216043 7.44250 8.01885 47.9138   100
      dplyr_way 4.7334 5.15005 5.350857 5.25085 5.40395  9.5594   100

和数据：

ID <- c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","ddd","eee","fff","ggg","hhh","iii","aaa","bbb","ccc","aaa","bbb","ccc")
Condition <- c("Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Pre","Post","Post","Post","Post","Post","Post","Post","Post","Post","Pre","Pre","Pre","Post","Post", "Post")
Score <- as.integer(c(23,20,19,15,22,22,20,19,18,17,17,19,20,22,22,14,15,10,23,23,21,20,18,11))

根据指定的值差异过滤 ID

Filtering ID's based on specified difference in values

r

filter

difference

dplyr