重塑然后折叠数据框
Reshaping then collapsing dataframes
努力将我凌乱、不等长的 data.frame
从宽 table 转换为长 table,然后为新变量折叠(总结)。目前它看起来像这样,Gene
作为一个变量,GO_terms
作为一个包含多个逗号分隔值的变量:
Gene GO_terms
AA1006G00001 GO:0098655, GO:0008643, GO:0005351, GO:0005886, GO:0016021
AA100G00001 GO:0098655, GO:0009944, GO:0009862, GO:0010075, GO:0010014, GO:0009855, GO:0010310
AA100G00002 GO:0098655, GO:0008643, GO:0005886
我想做的第一步是转换为 "long" 格式,所以它看起来像这样:
Gene GO_terms
AA1006G00001 GO:0098655
AA1006G00001 GO:0008643
AA1006G00001 GO:0005351
AA1006G00001 GO:0005886
AA1006G00001 GO:0016021
AA100G00001 GO:0001666
AA100G00001 GO:0009944
AA100G00001 GO:0009862
AA100G00001 GO:0010075
AA100G00001 GO:0010014
AA100G00001 GO:0009855
AA100G00001 GO:0010310
AA100G00002 GO:0008270
AA100G00002 GO:0005634
AA100G00002 GO:0005886
AA100G00003 GO:0005488
AA100G00003 GO:0005634
然后,我想通过交换两个变量来重组这个data.table
,整理如下:
GO_terms Genes
GO:0005351 AA1006G00001
GO:0005886 AA1006G00001, AA100G00002
GO:0008643 AA1006G00001, AA100G00002
GO:0009855 AA100G00001
GO:0009862 AA100G00001
GO:0009944 AA100G00001
GO:0010014 AA100G00001
GO:0010075 AA100G00001
GO:0010310 AA100G00001
GO:0016021 AA1006G00001
GO:0098655 AA1006G00001, AA100G00001, AA100G00002
包含基因的变量可以在一列中(用逗号分隔值),也可以在多列中。
有人可以提供 tidyr
、reshape2
或 dplyr
解决方案吗?
编辑:dput()
table 是:
structure(list(`Gene ` = c("AA1006G00001\t", "AA100G00001\t",
"AA100G00002\t"), `GO_terms ` = c("GO:0098655, GO:0008643, GO:0005351, GO:0005886, GO:0016021\t\t",
"GO:0098655, GO:0009944, GO:0009862, GO:0010075, GO:0010014, GO:0009855, GO:0010310",
"GO:0098655, GO:0008643, GO:0005886")), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(
cols = list(`Gene ` = structure(list(), class = c("collector_character",
"collector")), `GO_terms ` = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector"))), class = "col_spec"))
看来你在做一些围棋分析。您可以尝试 topGO
中的 inverseList
(Bioconductor 中最流行的 GO 分析 R 包之一):
library(topGO)
gene.to.go <- strsplit(gsub('\t', '', df$GO_terms), ', ', fixed = TRUE)
names(gene.to.go) <- gsub('\t', '', df$Gene)
go.to.gene <- inverseList(gene.to.go)
data.frame(GO_term = names(go.to.gene), Genes = sapply(go.to.gene, paste0, collapse = ', '),
stringsAsFactors = FALSE, row.names = NULL)
# GO_term Genes
# 1 GO:0005351 AA1006G00001
# 2 GO:0005886 AA1006G00001, AA100G00002
# 3 GO:0008643 AA1006G00001, AA100G00002
# 4 GO:0009855 AA100G00001
# 5 GO:0009862 AA100G00001
# 6 GO:0009944 AA100G00001
# 7 GO:0010014 AA100G00001
# 8 GO:0010075 AA100G00001
# 9 GO:0010310 AA100G00001
# 10 GO:0016021 AA1006G00001
# 11 GO:0098655 AA1006G00001, AA100G00001, AA100G00002
其实在topGO
.
中导入readMappings
的GO映射文件,对数据进行操作会更方便
这是一个 tidyr 和 dplyr 解决方案:
library(tidyr)
library(dplyr)
#allow up to seven Genes per GO_term if there is more increase the letters expression
long<-df %>% separate(GO_terms, into=paste0("a", 1:100), sep=", ", extra="merge") %>%
gather( key="key", value="GO_terms", -Gene)
#filter data frame, remove the NA and keep the desired columns
long<-long[!is.na(long$GO_terms), c("Gene", "GO_terms")]
final<-long %>% group_by(GO_terms) %>% summarize( Gene=toString(Gene) )
努力将我凌乱、不等长的 data.frame
从宽 table 转换为长 table,然后为新变量折叠(总结)。目前它看起来像这样,Gene
作为一个变量,GO_terms
作为一个包含多个逗号分隔值的变量:
Gene GO_terms
AA1006G00001 GO:0098655, GO:0008643, GO:0005351, GO:0005886, GO:0016021
AA100G00001 GO:0098655, GO:0009944, GO:0009862, GO:0010075, GO:0010014, GO:0009855, GO:0010310
AA100G00002 GO:0098655, GO:0008643, GO:0005886
我想做的第一步是转换为 "long" 格式,所以它看起来像这样:
Gene GO_terms
AA1006G00001 GO:0098655
AA1006G00001 GO:0008643
AA1006G00001 GO:0005351
AA1006G00001 GO:0005886
AA1006G00001 GO:0016021
AA100G00001 GO:0001666
AA100G00001 GO:0009944
AA100G00001 GO:0009862
AA100G00001 GO:0010075
AA100G00001 GO:0010014
AA100G00001 GO:0009855
AA100G00001 GO:0010310
AA100G00002 GO:0008270
AA100G00002 GO:0005634
AA100G00002 GO:0005886
AA100G00003 GO:0005488
AA100G00003 GO:0005634
然后,我想通过交换两个变量来重组这个data.table
,整理如下:
GO_terms Genes
GO:0005351 AA1006G00001
GO:0005886 AA1006G00001, AA100G00002
GO:0008643 AA1006G00001, AA100G00002
GO:0009855 AA100G00001
GO:0009862 AA100G00001
GO:0009944 AA100G00001
GO:0010014 AA100G00001
GO:0010075 AA100G00001
GO:0010310 AA100G00001
GO:0016021 AA1006G00001
GO:0098655 AA1006G00001, AA100G00001, AA100G00002
包含基因的变量可以在一列中(用逗号分隔值),也可以在多列中。
有人可以提供 tidyr
、reshape2
或 dplyr
解决方案吗?
编辑:dput()
table 是:
structure(list(`Gene ` = c("AA1006G00001\t", "AA100G00001\t",
"AA100G00002\t"), `GO_terms ` = c("GO:0098655, GO:0008643, GO:0005351, GO:0005886, GO:0016021\t\t",
"GO:0098655, GO:0009944, GO:0009862, GO:0010075, GO:0010014, GO:0009855, GO:0010310",
"GO:0098655, GO:0008643, GO:0005886")), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(
cols = list(`Gene ` = structure(list(), class = c("collector_character",
"collector")), `GO_terms ` = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector"))), class = "col_spec"))
看来你在做一些围棋分析。您可以尝试 topGO
中的 inverseList
(Bioconductor 中最流行的 GO 分析 R 包之一):
library(topGO)
gene.to.go <- strsplit(gsub('\t', '', df$GO_terms), ', ', fixed = TRUE)
names(gene.to.go) <- gsub('\t', '', df$Gene)
go.to.gene <- inverseList(gene.to.go)
data.frame(GO_term = names(go.to.gene), Genes = sapply(go.to.gene, paste0, collapse = ', '),
stringsAsFactors = FALSE, row.names = NULL)
# GO_term Genes
# 1 GO:0005351 AA1006G00001
# 2 GO:0005886 AA1006G00001, AA100G00002
# 3 GO:0008643 AA1006G00001, AA100G00002
# 4 GO:0009855 AA100G00001
# 5 GO:0009862 AA100G00001
# 6 GO:0009944 AA100G00001
# 7 GO:0010014 AA100G00001
# 8 GO:0010075 AA100G00001
# 9 GO:0010310 AA100G00001
# 10 GO:0016021 AA1006G00001
# 11 GO:0098655 AA1006G00001, AA100G00001, AA100G00002
其实在topGO
.
readMappings
的GO映射文件,对数据进行操作会更方便
这是一个 tidyr 和 dplyr 解决方案:
library(tidyr)
library(dplyr)
#allow up to seven Genes per GO_term if there is more increase the letters expression
long<-df %>% separate(GO_terms, into=paste0("a", 1:100), sep=", ", extra="merge") %>%
gather( key="key", value="GO_terms", -Gene)
#filter data frame, remove the NA and keep the desired columns
long<-long[!is.na(long$GO_terms), c("Gene", "GO_terms")]
final<-long %>% group_by(GO_terms) %>% summarize( Gene=toString(Gene) )