R:如何将列表列表(来自str_split)折叠成一个列表并保留一些行数据?
R: How to colapse a list of list (from str_split) into just one list and preserve some row data?
str_split 的输出生成一个列表。 list of list如何折叠成扁平list?
查看下面的示例数据:
library(magrittr)
library(dplyr)
url='https://github.com/macarthur-lab/clinvar/raw/master/output/clinvar.tsv.gz'
w=readr::read_tsv(url) #warnings can be safely ignored
w<-w %>% filter(grepl('LabCorp',all_submitters))
#traits are separated by semicolons
ttd<-stringr::str_split(w$all_traits,pattern = ';')
#there are several traits per row from str_split
ttd.l<-sapply(ttd,length)
#sample
ttd[[77]]
[1] "Hereditary cancer-predisposing syndrome"
[2] "Lynch syndrome"
[3] "Lynch Syndrome"
[4] "Neoplastic Syndromes, Hereditary"
[5] "Hereditary non-polyposis colon cancer"
#how to put all 'all-traits' into single vector
这个好像不行:
traits<-lapply(ttd,c)
table(traits)
编辑:简单 unlist(ttd) 的问题是我需要在 w$measureset_id
中保留行的 ID
像这样:
out=data.frame()
for (i in 1:length(ttd)) {
print(i)
#unlist(ttd[[i]])
one<-data.frame(id=w[i,'measureset_id']
,trait=unique(toupper(unlist(ttd[[i]]))))
out<-rbind(out,one)
}
head(out,5)
measureset_id trait
1 36663 CARDIAC ARRHYTHMIA
2 36663 ARRHYTHMIA
3 12779 PHEOCHROMOCYTOMA/PARAGANGLIOMA
4 12779 PHEOCHROMOCYTOMA
5 12779 PARAGANGLIOMAS 4
您的 ttd 是一个字符向量列表。如果你想要的是所有元素的长度为 3992 的字符向量,那么你只需要
traits <- unlist(ttd)
根据您想要的附加信息,这里有几种方法可以做到。在您创建 ttd 之前,我会立即跳入您的代码中,因为这只会让您自己的生活变得艰难。
library(plyr)
library(dplyr)
#First, create a useful function
getTraits <- function(x) data_frame(trait=unique(unlist(strsplit(x$all_traits, split=";"))))
#Method 1 using plyr
traits <- ddply(w, .(measureset_id), getTraits)
head(traits)
# measureset_id trait
#1 788 Sudden infant death syndrome
#2 788 Brugada syndrome 2
#3 788 Primary familial hypertrophic cardiomyopathy
#4 788 Sudden Infant Death Syndrome
#5 788 Cardiomyopathy
#6 788 Long QT syndrome
traits[traits$measureset_id == 36663, ]
# measureset_id trait
#3231 36663 Cardiac arrhythmia
#3232 36663 Arrhythmia
#Method 2 using dplyr
traitsd <- w %>% group_by(measureset_id) %>% do(getTraits(.))
head(traitsd)
#Source: local data frame [6 x 2]
#Groups: measureset_id [1]
#
# measureset_id trait
# (int) (chr)
#1 788 Sudden infant death syndrome
#2 788 Brugada syndrome 2
#3 788 Primary familial hypertrophic cardiomyopathy
#4 788 Sudden Infant Death Syndrome
#5 788 Cardiomyopathy
#6 788 Long QT syndrome
traitsd[traitsd$measureset_id == 36663, ]
#Source: local data frame [2 x 2]
#Groups: measureset_id [1]
#
# measureset_id trait
# (int) (chr)
#1 36663 Cardiac arrhythmia
#2 36663 Arrhythmia
str_split 的输出生成一个列表。 list of list如何折叠成扁平list?
查看下面的示例数据:
library(magrittr)
library(dplyr)
url='https://github.com/macarthur-lab/clinvar/raw/master/output/clinvar.tsv.gz'
w=readr::read_tsv(url) #warnings can be safely ignored
w<-w %>% filter(grepl('LabCorp',all_submitters))
#traits are separated by semicolons
ttd<-stringr::str_split(w$all_traits,pattern = ';')
#there are several traits per row from str_split
ttd.l<-sapply(ttd,length)
#sample
ttd[[77]]
[1] "Hereditary cancer-predisposing syndrome"
[2] "Lynch syndrome"
[3] "Lynch Syndrome"
[4] "Neoplastic Syndromes, Hereditary"
[5] "Hereditary non-polyposis colon cancer"
#how to put all 'all-traits' into single vector
这个好像不行:
traits<-lapply(ttd,c)
table(traits)
编辑:简单 unlist(ttd) 的问题是我需要在 w$measureset_id
中保留行的 ID像这样:
out=data.frame()
for (i in 1:length(ttd)) {
print(i)
#unlist(ttd[[i]])
one<-data.frame(id=w[i,'measureset_id']
,trait=unique(toupper(unlist(ttd[[i]]))))
out<-rbind(out,one)
}
head(out,5)
measureset_id trait
1 36663 CARDIAC ARRHYTHMIA
2 36663 ARRHYTHMIA
3 12779 PHEOCHROMOCYTOMA/PARAGANGLIOMA
4 12779 PHEOCHROMOCYTOMA
5 12779 PARAGANGLIOMAS 4
您的 ttd 是一个字符向量列表。如果你想要的是所有元素的长度为 3992 的字符向量,那么你只需要
traits <- unlist(ttd)
根据您想要的附加信息,这里有几种方法可以做到。在您创建 ttd 之前,我会立即跳入您的代码中,因为这只会让您自己的生活变得艰难。
library(plyr)
library(dplyr)
#First, create a useful function
getTraits <- function(x) data_frame(trait=unique(unlist(strsplit(x$all_traits, split=";"))))
#Method 1 using plyr
traits <- ddply(w, .(measureset_id), getTraits)
head(traits)
# measureset_id trait
#1 788 Sudden infant death syndrome
#2 788 Brugada syndrome 2
#3 788 Primary familial hypertrophic cardiomyopathy
#4 788 Sudden Infant Death Syndrome
#5 788 Cardiomyopathy
#6 788 Long QT syndrome
traits[traits$measureset_id == 36663, ]
# measureset_id trait
#3231 36663 Cardiac arrhythmia
#3232 36663 Arrhythmia
#Method 2 using dplyr
traitsd <- w %>% group_by(measureset_id) %>% do(getTraits(.))
head(traitsd)
#Source: local data frame [6 x 2]
#Groups: measureset_id [1]
#
# measureset_id trait
# (int) (chr)
#1 788 Sudden infant death syndrome
#2 788 Brugada syndrome 2
#3 788 Primary familial hypertrophic cardiomyopathy
#4 788 Sudden Infant Death Syndrome
#5 788 Cardiomyopathy
#6 788 Long QT syndrome
traitsd[traitsd$measureset_id == 36663, ]
#Source: local data frame [2 x 2]
#Groups: measureset_id [1]
#
# measureset_id trait
# (int) (chr)
#1 36663 Cardiac arrhythmia
#2 36663 Arrhythmia