如何格式化data.frame?
How to format data.frame?
我如何根据 GeneID 重新格式化 data.frame df1。必须根据共同的 GeneID 对 table 进行分组。我也想拆分位置
df1 =
GeneID Common Organism Name Position
3 mouse 10090 Acadm Chr5:26082574-26089291(-)
3 human 9606 ACADM Chr5:15028950-15032998(-)
6 mouse 10090 Acat1 Chr5:25999022-26004798(-)
6 human 9606 ACAT1 Chr5:15471699-15477027(-)
7 human 9606 NLN Chr5:26257691-26264308(+)
8 mouse 10090 canct1 Chr5:14910122-14914899(-)
9 mouse 9606 Gm10220 Chr5:25936465-25943267(-)
9 mouse 9606 Gm10354 Chr5:25949797-25954344(-)
9 mouse 9606 Gm1979 Chr5:11594913-11599784(+)
9 human 10090 TRIL Chr7:28953358-28958413(-)
预期输出
Gene.ID M.Gene M.Chr M.start M.end H.Gene H.Chr H.start H.end
3 Acadm 5 26082574 26089291 ACADM 5 15028950 15032998
6 Acat1 5 25999022 26004798 ACAT1 5 15471699 15477027
7 NA NA NA NA NLN 5 26257691 26264308
8 canct1 5 14910122 14914899 NA NA NA NA
9 Gm10220 5 25936465 25943267 TRIL 7 28953358 28958413
9 Gm10354 5 25949797 25954344 TRIL 7 28953358 28958413
9 Gm1979 5 1159491 11599784 TRIL 7 28953358 28958413
9 Gm21149 5 11594913 11599784 TRIL 7 28953358 28958413
我们可以使用 'data.table' 的 devel
版本,即。 v1.9.5
。安装说明是 here
.
我们将 'data.frame' 更改为 'data.table' (setDT(df1)
)。使用 tstrsplit
,我们将 'Position' 除以所有非数字字符 ([^0-9]+
) 以创建新列 ('Chr'、'start'、'end').
library(data.table)#v1.9.5+
DT <- setDT(df1)[, c('Chr', 'start', 'end') :=tstrsplit(Position, '[^0-9]+')[-1]]
创建按 'GeneID' 和 'Common'
分组的序列列 ('ind')
DT[, ind:=1:.N, .(GeneID, Common)]
dcast
从 devel 版本可以取多个 value.var
列并将 'long' 格式更改为 'wide' 格式。我们可以用数据集中的非 NA 值替换 NA 值。
dcast(DT, GeneID+ind~substr(Common, 1, 1), value.var=names(DT)[c(4,6:8)])[,
lapply(.SD, function(x) x[!is.na(x)]) , GeneID, .SDcols=h_Name:m_end]
# GeneID h_Name m_Name h_Chr m_Chr h_start m_start h_end m_end
#1: 3 ACADM Acadm 5 5 15028950 26082574 15032998 26089291
#2: 6 ACAT1 Acat1 5 5 15471699 25999022 15477027 26004798
#3: 7 NLN NA 5 NA 26257691 NA 26264308 NA
#4: 8 NA canct1 NA 5 NA 14910122 NA 14914899
#5: 9 TRIL Gm10220 7 5 28953358 25936465 28958413 25943267
#6: 9 TRIL Gm10354 7 5 28953358 25949797 28958413 25954344
#7: 9 TRIL Gm1979 7 5 28953358 11594913 28958413 11599784
另一个选项使用 lapply
# using split method from akrun's answer
library(data.table)#v1.9.5+
DT <- setDT(df1)[, c('Chr', 'start', 'end') :=tstrsplit(Position, '[^0-9]+')[-1]]
out = setDF(Reduce(function(...) merge(..., by="GeneID", all = T),
lapply(split(DT, DT$Common),
function(x) subset(x, select = -c(Common, Position, Organism)))))
colnames(out) = gsub("x", "H", colnames(out))
colnames(out) = gsub("y", "M", colnames(out))
#> out
# GeneID Name.H Chr.H start.H end.H Name.M Chr.M start.M end.M
#1 3 ACADM 5 15028950 15032998 Acadm 5 26082574 26089291
#2 6 ACAT1 5 15471699 15477027 Acat1 5 25999022 26004798
#3 7 NLN 5 26257691 26264308 <NA> <NA> <NA> <NA>
#4 8 <NA> <NA> <NA> <NA> canct1 5 14910122 14914899
#5 9 TRIL 7 28953358 28958413 Gm10220 5 25936465 25943267
#6 9 TRIL 7 28953358 28958413 Gm10354 5 25949797 25954344
#7 9 TRIL 7 28953358 28958413 Gm1979 5 11594913 11599784
我如何根据 GeneID 重新格式化 data.frame df1。必须根据共同的 GeneID 对 table 进行分组。我也想拆分位置
df1 =
GeneID Common Organism Name Position
3 mouse 10090 Acadm Chr5:26082574-26089291(-)
3 human 9606 ACADM Chr5:15028950-15032998(-)
6 mouse 10090 Acat1 Chr5:25999022-26004798(-)
6 human 9606 ACAT1 Chr5:15471699-15477027(-)
7 human 9606 NLN Chr5:26257691-26264308(+)
8 mouse 10090 canct1 Chr5:14910122-14914899(-)
9 mouse 9606 Gm10220 Chr5:25936465-25943267(-)
9 mouse 9606 Gm10354 Chr5:25949797-25954344(-)
9 mouse 9606 Gm1979 Chr5:11594913-11599784(+)
9 human 10090 TRIL Chr7:28953358-28958413(-)
预期输出
Gene.ID M.Gene M.Chr M.start M.end H.Gene H.Chr H.start H.end
3 Acadm 5 26082574 26089291 ACADM 5 15028950 15032998
6 Acat1 5 25999022 26004798 ACAT1 5 15471699 15477027
7 NA NA NA NA NLN 5 26257691 26264308
8 canct1 5 14910122 14914899 NA NA NA NA
9 Gm10220 5 25936465 25943267 TRIL 7 28953358 28958413
9 Gm10354 5 25949797 25954344 TRIL 7 28953358 28958413
9 Gm1979 5 1159491 11599784 TRIL 7 28953358 28958413
9 Gm21149 5 11594913 11599784 TRIL 7 28953358 28958413
我们可以使用 'data.table' 的 devel
版本,即。 v1.9.5
。安装说明是 here
.
我们将 'data.frame' 更改为 'data.table' (setDT(df1)
)。使用 tstrsplit
,我们将 'Position' 除以所有非数字字符 ([^0-9]+
) 以创建新列 ('Chr'、'start'、'end').
library(data.table)#v1.9.5+
DT <- setDT(df1)[, c('Chr', 'start', 'end') :=tstrsplit(Position, '[^0-9]+')[-1]]
创建按 'GeneID' 和 'Common'
分组的序列列 ('ind')DT[, ind:=1:.N, .(GeneID, Common)]
dcast
从 devel 版本可以取多个 value.var
列并将 'long' 格式更改为 'wide' 格式。我们可以用数据集中的非 NA 值替换 NA 值。
dcast(DT, GeneID+ind~substr(Common, 1, 1), value.var=names(DT)[c(4,6:8)])[,
lapply(.SD, function(x) x[!is.na(x)]) , GeneID, .SDcols=h_Name:m_end]
# GeneID h_Name m_Name h_Chr m_Chr h_start m_start h_end m_end
#1: 3 ACADM Acadm 5 5 15028950 26082574 15032998 26089291
#2: 6 ACAT1 Acat1 5 5 15471699 25999022 15477027 26004798
#3: 7 NLN NA 5 NA 26257691 NA 26264308 NA
#4: 8 NA canct1 NA 5 NA 14910122 NA 14914899
#5: 9 TRIL Gm10220 7 5 28953358 25936465 28958413 25943267
#6: 9 TRIL Gm10354 7 5 28953358 25949797 28958413 25954344
#7: 9 TRIL Gm1979 7 5 28953358 11594913 28958413 11599784
另一个选项使用 lapply
# using split method from akrun's answer
library(data.table)#v1.9.5+
DT <- setDT(df1)[, c('Chr', 'start', 'end') :=tstrsplit(Position, '[^0-9]+')[-1]]
out = setDF(Reduce(function(...) merge(..., by="GeneID", all = T),
lapply(split(DT, DT$Common),
function(x) subset(x, select = -c(Common, Position, Organism)))))
colnames(out) = gsub("x", "H", colnames(out))
colnames(out) = gsub("y", "M", colnames(out))
#> out
# GeneID Name.H Chr.H start.H end.H Name.M Chr.M start.M end.M
#1 3 ACADM 5 15028950 15032998 Acadm 5 26082574 26089291
#2 6 ACAT1 5 15471699 15477027 Acat1 5 25999022 26004798
#3 7 NLN 5 26257691 26264308 <NA> <NA> <NA> <NA>
#4 8 <NA> <NA> <NA> <NA> canct1 5 14910122 14914899
#5 9 TRIL 7 28953358 28958413 Gm10220 5 25936465 25943267
#6 9 TRIL 7 28953358 28958413 Gm10354 5 25949797 25954344
#7 9 TRIL 7 28953358 28958413 Gm1979 5 11594913 11599784