如何重塑数据(使用列名解析)
How to Reshape data (with col name parsing)
需要从这个
重塑一个data.frame
TestID Machine1Measure Machine1Count Machine2Measure Machine2Count
1 10006 11 14 16 24
2 10007 23 27 32 35
为此:
TestID Machine Measure Count
1 10006 1 11 14
2 10006 2 16 24
3 10007 1 23 27
4 10007 2 32 35
下面是创建每个的代码。查看了 R 中的 reshape 但无法弄清楚如何拆分名称
注意:这是列的子集 - 有 70-140 台机器。我怎样才能使它更简单?
b <-data.frame(10006:10007, matrix(c(11,23,14,27,16,32,24,35),2,4))
colnames(b) <- c("TestID", "Machine1Measure", "Machine1Count", "Machine2Measure", "Machine2Count")
a<-data.frame(matrix(c(10006,10006,10007,10007,1,2,1,2,11,16,23,32,14,24,27,35),4,4))
colnames(a) <- c("TestID", "Machine", "Measure", "Count")
b
a
以下重现了您的预期输出:
df %>%
gather(key, value, -TestID) %>%
separate(key, into = c("tmp", "what"), sep = "(?<=\d)") %>%
separate(tmp, into = c("tmp", "Machine"), sep = "(?=\d+)") %>%
spread(what, value) %>%
select(-tmp)
# TestID Machine Count Measure
#1 10006 1 14 11
#2 10006 2 24 16
#3 10007 1 27 23
#4 10007 2 35 32
说明:我们将数据从宽整形为长,并使用两次 separate
调用来分离各种值和 ID,然后再次从长整形为宽。 (我们使用积极的前瞻和积极的回顾来将键分成所需的字段。)
示例数据
df <- read.table(text =
" TestID Machine1Measure Machine1Count Machine2Measure Machine2Count
1 10006 11 14 16 24
2 10007 23 27 32 35", header = T)
data.table
可以在一个 melt
内完成所有这些工作,这比 MauritsEvers 提供的(完美运行的)tidyverse 解决方案快将近 30 倍。
它使用patterns
来定义名称中包含'Measure'和'Count'的列,然后将这些列融合到value.name
[=16中的列名称=]
library( data.table )
melt( setDT( b),
id.vars = c("TestID"),
measure.vars = patterns( ".*Measure", ".*Count"),
variable.name = "Machine",
value.name = c("Measure", "Count") )
# TestID Machine Measure Count
# 1: 10006 1 11 14
# 2: 10007 1 23 27
# 3: 10006 2 16 24
# 4: 10007 2 32 35
基准测试
# Unit: microseconds
# expr min lq mean median uq max neval
# data.table 182.265 200.3405 245.0403 234.0825 264.6605 3137.967 1000
# reshape 1757.575 1840.7240 2180.4957 1938.3335 2011.3895 100429.392 1000
# tidyverse 6173.203 6430.7830 6925.6034 6569.9670 6763.9810 29722.714 1000
而且由于没有人再喜欢 reshape()
,我将添加一个答案:
reshape(
setNames(b, sub("^.+(\d+)(.+)$", "\2.\1", names(b))),
idvar="TestID", direction="long", varying=-1, timevar="Machine"
)
# TestID Machine Measure Count
#10006.1 10006 1 11 14
#10007.1 10007 1 23 27
#10006.2 10006 2 16 24
#10007.2 10007 2 32 35
它永远不会与 data.table
竞争纯速度,但使用以下方法对 2M 行进行简短测试:
bbig <- b[rep(1:2,each=1e6),]
bbig$TestID <- make.unique(as.character(bbig$TestID))
#data.table - 0.06 secs
#reshape - 2.30 secs
#tidyverse - 56.60 secs
需要从这个
重塑一个data.frame TestID Machine1Measure Machine1Count Machine2Measure Machine2Count
1 10006 11 14 16 24
2 10007 23 27 32 35
为此:
TestID Machine Measure Count
1 10006 1 11 14
2 10006 2 16 24
3 10007 1 23 27
4 10007 2 32 35
下面是创建每个的代码。查看了 R 中的 reshape 但无法弄清楚如何拆分名称
注意:这是列的子集 - 有 70-140 台机器。我怎样才能使它更简单?
b <-data.frame(10006:10007, matrix(c(11,23,14,27,16,32,24,35),2,4))
colnames(b) <- c("TestID", "Machine1Measure", "Machine1Count", "Machine2Measure", "Machine2Count")
a<-data.frame(matrix(c(10006,10006,10007,10007,1,2,1,2,11,16,23,32,14,24,27,35),4,4))
colnames(a) <- c("TestID", "Machine", "Measure", "Count")
b
a
以下重现了您的预期输出:
df %>%
gather(key, value, -TestID) %>%
separate(key, into = c("tmp", "what"), sep = "(?<=\d)") %>%
separate(tmp, into = c("tmp", "Machine"), sep = "(?=\d+)") %>%
spread(what, value) %>%
select(-tmp)
# TestID Machine Count Measure
#1 10006 1 14 11
#2 10006 2 24 16
#3 10007 1 27 23
#4 10007 2 35 32
说明:我们将数据从宽整形为长,并使用两次 separate
调用来分离各种值和 ID,然后再次从长整形为宽。 (我们使用积极的前瞻和积极的回顾来将键分成所需的字段。)
示例数据
df <- read.table(text =
" TestID Machine1Measure Machine1Count Machine2Measure Machine2Count
1 10006 11 14 16 24
2 10007 23 27 32 35", header = T)
data.table
可以在一个 melt
内完成所有这些工作,这比 MauritsEvers 提供的(完美运行的)tidyverse 解决方案快将近 30 倍。
它使用patterns
来定义名称中包含'Measure'和'Count'的列,然后将这些列融合到value.name
[=16中的列名称=]
library( data.table )
melt( setDT( b),
id.vars = c("TestID"),
measure.vars = patterns( ".*Measure", ".*Count"),
variable.name = "Machine",
value.name = c("Measure", "Count") )
# TestID Machine Measure Count
# 1: 10006 1 11 14
# 2: 10007 1 23 27
# 3: 10006 2 16 24
# 4: 10007 2 32 35
基准测试
# Unit: microseconds
# expr min lq mean median uq max neval
# data.table 182.265 200.3405 245.0403 234.0825 264.6605 3137.967 1000
# reshape 1757.575 1840.7240 2180.4957 1938.3335 2011.3895 100429.392 1000
# tidyverse 6173.203 6430.7830 6925.6034 6569.9670 6763.9810 29722.714 1000
而且由于没有人再喜欢 reshape()
,我将添加一个答案:
reshape(
setNames(b, sub("^.+(\d+)(.+)$", "\2.\1", names(b))),
idvar="TestID", direction="long", varying=-1, timevar="Machine"
)
# TestID Machine Measure Count
#10006.1 10006 1 11 14
#10007.1 10007 1 23 27
#10006.2 10006 2 16 24
#10007.2 10007 2 32 35
它永远不会与 data.table
竞争纯速度,但使用以下方法对 2M 行进行简短测试:
bbig <- b[rep(1:2,each=1e6),]
bbig$TestID <- make.unique(as.character(bbig$TestID))
#data.table - 0.06 secs
#reshape - 2.30 secs
#tidyverse - 56.60 secs