如何重塑数据(使用列名解析)

How to Reshape data (with col name parsing)

需要从这个

重塑一个data.frame
  TestID Machine1Measure Machine1Count Machine2Measure Machine2Count
1  10006              11            14              16            24
2  10007              23            27              32            35

为此:

  TestID Machine Measure Count
1  10006       1      11    14
2  10006       2      16    24
3  10007       1      23    27
4  10007       2      32    35

下面是创建每个的代码。查看了 R 中的 reshape 但无法弄清楚如何拆分名称

注意:这是列的子集 - 有 70-140 台机器。我怎样才能使它更简单?

b <-data.frame(10006:10007, matrix(c(11,23,14,27,16,32,24,35),2,4)) 
colnames(b) <- c("TestID", "Machine1Measure", "Machine1Count", "Machine2Measure", "Machine2Count") 

a<-data.frame(matrix(c(10006,10006,10007,10007,1,2,1,2,11,16,23,32,14,24,27,35),4,4)) 
colnames(a) <- c("TestID", "Machine", "Measure", "Count") 

b
a

以下重现了您的预期输出:

df %>%
    gather(key, value, -TestID) %>%
    separate(key, into = c("tmp", "what"), sep = "(?<=\d)") %>%
    separate(tmp, into = c("tmp", "Machine"), sep = "(?=\d+)") %>%
    spread(what, value) %>%
    select(-tmp)
#  TestID Machine Count Measure
#1  10006       1    14      11
#2  10006       2    24      16
#3  10007       1    27      23
#4  10007       2    35      32

说明:我们将数据从宽整形为长,并使用两次 separate 调用来分离各种值和 ID,然后再次从长整形为宽。 (我们使用积极的前瞻和积极的回顾来将键分成所需的字段。)


示例数据

df <- read.table(text =
    "  TestID Machine1Measure Machine1Count Machine2Measure Machine2Count
1  10006              11            14              16            24
2  10007              23            27              32            35", header = T)

data.table 可以在一个 melt 内完成所有这些工作,这比 MauritsEvers 提供的(完美运行的)tidyverse 解决方案快将近 30 倍。

它使用patterns来定义名称中包含'Measure'和'Count'的列,然后将这些列融合到value.name[=16中的列名称=]

library( data.table )
melt( setDT( b), 
      id.vars = c("TestID"), 
      measure.vars = patterns( ".*Measure", ".*Count"), 
      variable.name = "Machine", 
      value.name = c("Measure", "Count") )

#    TestID Machine Measure Count
# 1:  10006       1      11    14
# 2:  10007       1      23    27
# 3:  10006       2      16    24
# 4:  10007       2      32    35

基准测试

# Unit: microseconds
#       expr      min        lq      mean    median        uq        max neval
# data.table  182.265  200.3405  245.0403  234.0825  264.6605   3137.967  1000
# reshape    1757.575 1840.7240 2180.4957 1938.3335 2011.3895 100429.392  1000
# tidyverse  6173.203 6430.7830 6925.6034 6569.9670 6763.9810  29722.714  1000

而且由于没有人再喜欢 reshape(),我将添加一个答案:

reshape(
  setNames(b, sub("^.+(\d+)(.+)$", "\2.\1", names(b))),
  idvar="TestID", direction="long", varying=-1, timevar="Machine"
)

#        TestID Machine Measure Count
#10006.1  10006       1      11    14
#10007.1  10007       1      23    27
#10006.2  10006       2      16    24
#10007.2  10007       2      32    35

它永远不会与 data.table 竞争纯速度,但使用以下方法对 2M 行进行简短测试:

bbig <- b[rep(1:2,each=1e6),]
bbig$TestID <- make.unique(as.character(bbig$TestID))

#data.table -  0.06 secs
#reshape    -  2.30 secs
#tidyverse  - 56.60 secs