R:如何将列表列表(来自str_split)折叠成一个列表并保留一些行数据?

R: How to colapse a list of list (from str_split) into just one list and preserve some row data?

str_split 的输出生成一个列表。 list of list如何折叠成扁平list?

查看下面的示例数据:

library(magrittr)   
library(dplyr)


url='https://github.com/macarthur-lab/clinvar/raw/master/output/clinvar.tsv.gz'
w=readr::read_tsv(url) #warnings can be safely ignored
w<-w %>% filter(grepl('LabCorp',all_submitters))
#traits are separated by semicolons
ttd<-stringr::str_split(w$all_traits,pattern = ';')
#there are several traits per row from str_split
ttd.l<-sapply(ttd,length)
#sample
ttd[[77]]
[1] "Hereditary cancer-predisposing syndrome"
[2] "Lynch syndrome"                         
[3] "Lynch Syndrome"                         
[4] "Neoplastic Syndromes, Hereditary"       
[5] "Hereditary non-polyposis colon cancer"  
#how to put all 'all-traits' into single vector

这个好像不行:

traits<-lapply(ttd,c)
table(traits)

编辑:简单 unlist(ttd) 的问题是我需要在 w$measureset_id

中保留行的 ID

像这样:

out=data.frame()
for (i in 1:length(ttd)) {
  print(i)
  #unlist(ttd[[i]])
  one<-data.frame(id=w[i,'measureset_id']
                   ,trait=unique(toupper(unlist(ttd[[i]]))))
  out<-rbind(out,one)
}

head(out,5)

  measureset_id                          trait
1         36663             CARDIAC ARRHYTHMIA
2         36663                     ARRHYTHMIA
3         12779 PHEOCHROMOCYTOMA/PARAGANGLIOMA
4         12779               PHEOCHROMOCYTOMA
5         12779               PARAGANGLIOMAS 4

您的 ttd 是一个字符向量列表。如果你想要的是所有元素的长度为 3992 的字符向量,那么你只需要

traits <- unlist(ttd)

根据您想要的附加信息,这里有几种方法可以做到。在您创建 ttd 之前,我会立即跳入您的代码中,因为这只会让您自己的生活变得艰难。

library(plyr)
library(dplyr)

#First, create a useful function
getTraits <- function(x) data_frame(trait=unique(unlist(strsplit(x$all_traits, split=";"))))

#Method 1 using plyr
traits <- ddply(w, .(measureset_id), getTraits)
head(traits)
#  measureset_id                                        trait
#1           788                 Sudden infant death syndrome
#2           788                           Brugada syndrome 2
#3           788 Primary familial hypertrophic cardiomyopathy
#4           788                 Sudden Infant Death Syndrome
#5           788                               Cardiomyopathy
#6           788                             Long QT syndrome
traits[traits$measureset_id == 36663, ]
#     measureset_id              trait
#3231         36663 Cardiac arrhythmia
#3232         36663         Arrhythmia

#Method 2 using dplyr
traitsd <- w %>% group_by(measureset_id) %>% do(getTraits(.))
head(traitsd)
#Source: local data frame [6 x 2]
#Groups: measureset_id [1]
#
#  measureset_id                                        trait
#          (int)                                        (chr)
#1           788                 Sudden infant death syndrome
#2           788                           Brugada syndrome 2
#3           788 Primary familial hypertrophic cardiomyopathy
#4           788                 Sudden Infant Death Syndrome
#5           788                               Cardiomyopathy
#6           788                             Long QT syndrome
traitsd[traitsd$measureset_id == 36663, ]
#Source: local data frame [2 x 2]
#Groups: measureset_id [1]
#
#  measureset_id              trait
#          (int)              (chr)
#1         36663 Cardiac arrhythmia
#2         36663         Arrhythmia