重新格式化 R data.frame

Question

我有一个 data.frame 这种格式：

set.seed(1)
pl.mat <-matrix(rnorm(500*1000),nrow=500,ncol=1000)
colnames(pl.mat) <- gsub("\s+","",apply(expand.grid(paste("pl",1:10,sep=""),1:100),1,function(x) paste(unlist(x),collapse=".")),perl=T)
df <- cbind(data.frame(id=1:500,group.id=rep(1:25,20)),pl.mat)

> df[1:5,1:5]
  id group.id      pl1.1       pl2.1       pl3.1
1  1        1 -0.6264538  0.07730312  1.13496509
2  2        2  0.1836433 -0.29686864  1.11193185
3  3        3 -0.8356286 -1.18324224 -0.87077763
4  4        4  1.5952808  0.01129269  0.21073159
5  5        5  0.3295078  0.99160104  0.06939565

df$id 按 df$group.id 分组。然后每一列都有一个实验板id（pl1-pl10），句号后面的整数是一个well id（1-100）。因此每个板块有 100 列。

我想构建一个新的 data.frame 其中这些列： df$id、df$group.id、孔 ID 和所有板。

意思是这种格式：

id group.id      well.id      pl1       pl2       pl3
1  1             1     -0.6264538 0.07730312  1.13496509
1  1             2            ...       ...       ...
.
.
.
1  2             1            ...       ...       ...
.
.
.
500 25 .        100           ...       ...       ...

有什么好的简洁代码吗？

Answer 1

df %>% 
  gather(var, val, -id, -group.id) %>%
  separate(var, c("pl.id", "well.id")) %>% 
  spread(pl.id, val)

Answer 2

Dan，您可以创建一个包含所需列的新 data.frame。假设您想要列 df$id 和 df$group.id:

newDF <- as.data.frame(cbind(df$id, df$group.id))

现在，如果您有大量无法写出的列，您也可以使用索引：

newDF <- as.data.frame(cbind(df[,2], df[,5]))

因此，范围也有效：

newDF <- as.data.frame(cbind(df[,2:210], df[,507:1020]))

这对你有用吗？另一种解决方案是使用循环并动态构造索引或列名。这里是草稿：

for(i in 1:10) {
  print(eval(parse(text=paste("df$id", i, sep = ""))))
}

此处，df$id1 到 df$id10 的列名是动态构建的。

此致，托尔斯滕

重新格式化 R data.frame

Reformatting an R's data.frame

r

dataframe

dplyr

magrittr