从宽数据重塑为长数据,同时折叠 R 中相同 ID 的变量值

Reshaping from wide to long data while collapsing variable values for same IDs in R

我对 R 比较陌生。我有一个看起来像这样的数据框 df,其中 PMID 是一个 ID:

PMID          Variable         Value
1             MH               Humans
1             MH               Male
1             MH               Middle Aged
1             RN               Aldosterone
1             RN               Renin
2             MH               Accidents, Traffic
2             MH               Male
2             RN               Antivenins
3             MH               Humans
3             MH               Crotulus
3             MH               Young Adult

等等。可以看到,有的ID有多个MHand/or个RN,有的有none个或者一个。我想折叠每个 PMID 的每个变量的所有条目。我还希望能够在折叠后用逗号分隔每个条目,但首先将上述数据框中的空格替换为 _ 以便我可以保留每个值,以便我的最终数据框如下所示:

PMID         MH                                 RN
1            Humans, Male, Middle_Aged          Aldosterone, Renin
2            Accidents,_Traffic, Male           Antivenins
3            Humans, Crotulus, Young_Adult

我有超过 500 万行,所以请帮助提高代码的计算效率。感谢您的帮助。

这是一个解决方案,使用 dplyrtidyr

library(dplyr)
library(tidyr)

d <- read.table(
text='PMID;Variable;Value
1;MH;Humans
1;MH;Male
1;MH;Middle Aged
1;RN;Aldosterone
1;RN;Renin
2;MH;Accidents, Traffic
2;MH;Male
2;RN;Antivenins
3;MH;Humans
3;MH;Crotulus
3;MH;Young Adult', 
header=TRUE, sep=';', stringsAsFactors=FALSE)

d %>% 
  group_by(PMID, Variable) %>% 
  summarise(Value=paste(gsub(' ', '_', Value), collapse=', ')) %>% 
  spread(Variable, Value)


## Source: local data frame [3 x 3]
## Groups: PMID [3]
## 
## # A tibble: 3 x 3
##    PMID                            MH                  RN
## * <int>                         <chr>               <chr>
## 1     1     Humans, Male, Middle_Aged  Aldosterone, Renin
## 2     2      Accidents,_Traffic, Male          Antivenins
## 3     3 Humans, Crotulus, Young_Adult                <NA>

这是我的解决方案:

library(reshape2)
df=ddply(df,.(PMID,Variable), summarise,Pri = paste(Value,collapse=","))
acast(df, PMID ~ Variable)

  MH                           RN                 
1 "Humans,Male,MiddleAged"     "Aldosterone,Renin"
2 "Accidentstraffic,Male"      "Antivenins"       
3 "Humans,Crotulus,YoungAdult" NA  

由于生产数据集的大小超过 500 万行,OP 要求一个有效的解决方案,我建议使用 data.table:

library(data.table)   # CRAN version 1.10.4 used
setDT(df)[, Value := stringr::str_replace_all(Value, " ", "_")][]
dcast(df, PMID ~ Variable, toString, value.var = "Value")
   PMID                            MH                 RN
1:    1     Humans, Male, Middle_Aged Aldosterone, Renin
2:    2      Accidents,_Traffic, Male         Antivenins
3:    3 Humans, Crotulus, Young_Adult

数据

df <- readr::read_table(
 "PMID          Variable         Value
  1             MH               Humans
  1             MH               Male
  1             MH               Middle Aged
  1             RN               Aldosterone
  1             RN               Renin
  2             MH               Accidents, Traffic
  2             MH               Male
  2             RN               Antivenins
  3             MH               Humans
  3             MH               Crotulus
  3             MH               Young Adult"
)