在 R 中重塑铸造和熔化
Casting and Melting with reshape in R
举个例子,假设我有以下数据框:
datas=data.frame(Variables=c("Power","Happiness","Power","Happiness"),
Country=c("France", "France", "UK", "UK"), y2000=c(1213,1872,1726,2234), y2001=c(1234,2345,6433,9082))
导致以下输出:
Variables Country 2000 2001
1 Power France 1213 1234
2 Happiness France 1872 2345
3 Power UK 1726 6433
4 Happiness UK 2234 9082
我想按如下方式重塑此数据框:
Year Country Power Happiness
1 2000 France 1213 1872
2 2001 France 1234 2345
3 2000 UK 1726 2234
4 2001 UK 6433 9082
我开始时:
q2=cast(datas, Country~Variables, value="2000")
但随后出现如下错误:
Aggregation requires fun.aggregate: length used as default
Error in `[.data.frame`(sort_df(data, variables), , c(variables, "value"), :
undefined columns selected
有什么建议吗?
另外:我的数据框真的很大(417120 x 62)对解决方案有影响吗?
也许您对 tidyverse
替代方案感兴趣
library(tidyverse)
df %>%
gather(Year, val, -Variables, -Country) %>%
spread(Variables, val)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
或使用reshape2::melt
和reshape2::dcast
reshape2::dcast(
reshape2::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
或(相同地)使用 data.table::melt
和 data.table::dcast
data.table::dcast(
data.table::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
就 performance/runtime 而言,我认为 data.table
或 tidyr
解决方案是最有效的。您可以通过 运行 a microbenchmark
检查一些更大的样本数据。
示例数据
df <-read.table(text =
" Variables Country 2000 2001
1 Power France 1213 1234
2 Happiness France 1872 2345
3 Power UK 1726 6433
4 Happiness UK 2234 9082", header = T)
colnames(df)[3:4] <- c("2000", "2001")
基准分析
以下结果来自 microbenchmark
四种方法的分析,基于(稍微)更大的 78x22
样本数据集。
set.seed(2017)
df <- data.frame(
Variables = rep(c("Power", "Happiness", "something_else"), 26),
Country = rep(LETTERS[1:26], each = 3),
matrix(sample(10000, 20 * 26 * 3), nrow = 26 * 3))
colnames(df)[3:ncol(df)] <- 2000:2019
library(microbenchmark)
library(tidyr)
res <- microbenchmark(
reshape2 = {
reshape2::dcast(
reshape2::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
},
tidyr = {
df %>%
gather(Year, val, -Variables, -Country) %>%
spread(Variables, val)
},
datatable = {
data.table::dcast(
data.table::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
},
reshape = {
reshape::cast(reshape::melt(df), Country + variable ~ Variables)
}
)
res
#Unit: milliseconds
# expr min lq mean median uq max neval
# reshape2 3.088740 3.449686 4.313044 3.919372 5.112560 7.856902 100
# tidyr 4.482361 4.982017 6.215872 5.771133 6.931964 28.293377 100
# datatable 3.179035 3.511542 4.861192 4.040188 5.123103 46.010810 100
# reshape 27.371094 30.226222 32.425667 32.504644 34.118499 41.286803 100
library(ggplot2)
autoplot(res)
如上所述,我强烈建议使用 tidyr
而不是 reshape
,或者至少使用 reshape2
而不是 reshape
,因为它修复了许多性能问题reshape
.
的问题
在reshape
本身,你必须先融化datas
> cast(melt(datas), Country + variable ~ Variables)
Using Variables, Country as id variables
Country variable Happiness Power
1 France y2000 1872 1213
2 France y2001 2345 1234
3 UK y2000 2234 1726
4 UK y2001 9082 6433
然后根据需要重命名和转换列。
在 reshape2
中,代码是相同的,但您将使用 dcast
而不是 cast
。 tidyr
,正如@Maurits Evers 上面的解决方案是一个更好的解决方案,大多数开发已经从 reshape2
转移到 tidyverse
举个例子,假设我有以下数据框:
datas=data.frame(Variables=c("Power","Happiness","Power","Happiness"),
Country=c("France", "France", "UK", "UK"), y2000=c(1213,1872,1726,2234), y2001=c(1234,2345,6433,9082))
导致以下输出:
Variables Country 2000 2001
1 Power France 1213 1234
2 Happiness France 1872 2345
3 Power UK 1726 6433
4 Happiness UK 2234 9082
我想按如下方式重塑此数据框:
Year Country Power Happiness
1 2000 France 1213 1872
2 2001 France 1234 2345
3 2000 UK 1726 2234
4 2001 UK 6433 9082
我开始时:
q2=cast(datas, Country~Variables, value="2000")
但随后出现如下错误:
Aggregation requires fun.aggregate: length used as default
Error in `[.data.frame`(sort_df(data, variables), , c(variables, "value"), :
undefined columns selected
有什么建议吗? 另外:我的数据框真的很大(417120 x 62)对解决方案有影响吗?
也许您对 tidyverse
替代方案感兴趣
library(tidyverse)
df %>%
gather(Year, val, -Variables, -Country) %>%
spread(Variables, val)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
或使用reshape2::melt
和reshape2::dcast
reshape2::dcast(
reshape2::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
或(相同地)使用 data.table::melt
和 data.table::dcast
data.table::dcast(
data.table::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
# Country Year Happiness Power
#1 France 2000 1872 1213
#2 France 2001 2345 1234
#3 UK 2000 2234 1726
#4 UK 2001 9082 6433
就 performance/runtime 而言,我认为 data.table
或 tidyr
解决方案是最有效的。您可以通过 运行 a microbenchmark
检查一些更大的样本数据。
示例数据
df <-read.table(text =
" Variables Country 2000 2001
1 Power France 1213 1234
2 Happiness France 1872 2345
3 Power UK 1726 6433
4 Happiness UK 2234 9082", header = T)
colnames(df)[3:4] <- c("2000", "2001")
基准分析
以下结果来自 microbenchmark
四种方法的分析,基于(稍微)更大的 78x22
样本数据集。
set.seed(2017)
df <- data.frame(
Variables = rep(c("Power", "Happiness", "something_else"), 26),
Country = rep(LETTERS[1:26], each = 3),
matrix(sample(10000, 20 * 26 * 3), nrow = 26 * 3))
colnames(df)[3:ncol(df)] <- 2000:2019
library(microbenchmark)
library(tidyr)
res <- microbenchmark(
reshape2 = {
reshape2::dcast(
reshape2::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
},
tidyr = {
df %>%
gather(Year, val, -Variables, -Country) %>%
spread(Variables, val)
},
datatable = {
data.table::dcast(
data.table::melt(df, id.vars = c("Country", "Variables"), variable.name = "Year"),
Country + Year ~ Variables)
},
reshape = {
reshape::cast(reshape::melt(df), Country + variable ~ Variables)
}
)
res
#Unit: milliseconds
# expr min lq mean median uq max neval
# reshape2 3.088740 3.449686 4.313044 3.919372 5.112560 7.856902 100
# tidyr 4.482361 4.982017 6.215872 5.771133 6.931964 28.293377 100
# datatable 3.179035 3.511542 4.861192 4.040188 5.123103 46.010810 100
# reshape 27.371094 30.226222 32.425667 32.504644 34.118499 41.286803 100
library(ggplot2)
autoplot(res)
如上所述,我强烈建议使用 tidyr
而不是 reshape
,或者至少使用 reshape2
而不是 reshape
,因为它修复了许多性能问题reshape
.
在reshape
本身,你必须先融化datas
> cast(melt(datas), Country + variable ~ Variables)
Using Variables, Country as id variables
Country variable Happiness Power
1 France y2000 1872 1213
2 France y2001 2345 1234
3 UK y2000 2234 1726
4 UK y2001 9082 6433
然后根据需要重命名和转换列。
在 reshape2
中,代码是相同的,但您将使用 dcast
而不是 cast
。 tidyr
,正如@Maurits Evers 上面的解决方案是一个更好的解决方案,大多数开发已经从 reshape2
转移到 tidyverse