融化数据框 - 多列 - "Enhanced (new) functionality from data.tables"

melt dataframe - multiple columns - "Enhanced (new) functionality from data.tables"

更新:我应该更清楚地知道我正在尝试使用 data.tables https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html 检查重塑中的增强功能。更新了标题。

我有这个数据集,其中包含两组变量 - Credit_Risk_Capital 和 Name_concentration。它们是根据 2 种方法计算的 - 新旧方法。当我使用 data.table 包融化它们时,变量名称默认为 1 和 2。我怎样才能将它们更改为 Credit_Risk_Capital 和 Name_Concentration.

这里是数据集

    df <-data.table (id = c(1:100),Credit_risk_Capital_old= rnorm(100, mean = 400, sd = 60),
             NameConcentration_old= rnorm(100, mean = 100, sd = 10),
             Credit_risk_Capital_New =rnorm(100, mean = 200, sd = 10),
             NameConcentration_New = rnorm(100, mean = 40, sd = 10))
    old <- c('Credit_risk_Capital_old','NameConcentration_old')
   new<-c('Credit_risk_Capital_New','NameConcentration_New')
  t1<-melt(df, measure.vars = list(old,new), variable.name = "CapitalChargeType",value.name = c("old","new"))

现在,我希望将 CapitalChargeType 列中的元素标记为 1 和 2,而不是将它们更改为 Credit_risk_Capital 和 NameConcentration。我显然可以在后续步骤中使用 'match' 函数更改它们,但无论如何我都可以在 melt 本身内完成。

我不确定是否要使用 melt,但这里有一种使用 tidyr

的方法

请注意,我将变量名称更改为使用 . 而不是 _ 来分隔 old/new 的名称。这使得将名称分成两个变量更容易,因为已经有很多下划线。

library(tidyr)

df <- dplyr::data_frame(
  id = c(1:100),
  Credit_risk_Capital.old= rnorm(100, mean = 400, sd = 60),
  NameConcentration.old= rnorm(100, mean = 100, sd = 10),
  Credit_risk_Capital.new =rnorm(100, mean = 200, sd = 10),
  NameConcentration.new = rnorm(100, mean = 40, sd = 10)
)

df %>% 
  gather("key", "value", -id) %>% 
  separate(key, c("CapitalChargeType", "new_old"), sep = "\.") %>% 
  spread(new_old, value)

#> # A tibble: 200 x 4
#>       id   CapitalChargeType       new       old
#> *  <int>               <chr>     <dbl>     <dbl>
#> 1      1 Credit_risk_Capital 182.10955 405.78530
#> 2      1   NameConcentration  42.21037  99.44172
#> 3      2 Credit_risk_Capital 184.28810 370.14308
#> 4      2   NameConcentration  60.92340 120.13933
#> 5      3 Credit_risk_Capital 191.07982 389.50818
#> 6      3   NameConcentration  25.81776  90.91502
#> 7      4 Credit_risk_Capital 193.64247 327.56853
#> 8      4   NameConcentration  32.71050  94.95743
#> 9      5 Credit_risk_Capital 208.63547 286.59351
#> 10     5   NameConcentration  40.76064 116.52747
#> # ... with 190 more rows

这里的问题是 melt() 不知道如何在多个测量变量的情况下命名变量。因此,它只是简单地对变量进行编号。

has pointed out that there is a feature request. However, I will show two workarounds and compare them (plus ) 就速度而言。

  1. 第一种方法是 melt() 所有测量变量(保留变量名),创建新的变量名,然后再次 dcast() 临时结果以两个值列结束。 recast 方法也被 使用。
  2. 第二种方法是 OP 所要求的(同时熔化两个值列),但包括一种在之后重命名变量的简单方法。

重铸

library(data.table)   # CRAN version 1.10.4 used
# melt all measure variables
long <- melt(df, id.vars = "id")
# split variables names
long[, c("CapitalChargeType", "age") := 
       tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)] 
dcast(long, id + CapitalChargeType ~ age)
      id   CapitalChargeType       New       old
  1:   1 Credit_risk_Capital 204.85227 327.57606
  2:   1   NameConcentration  34.20043 104.14524
  3:   2 Credit_risk_Capital 206.96769 416.64575
  4:   2   NameConcentration  30.46721  95.25282
  5:   3 Credit_risk_Capital 201.85514 465.06647
 ---                                            
196:  98   NameConcentration  45.38833  90.34097
197:  99 Credit_risk_Capital 203.53625 458.37501
198:  99   NameConcentration  40.14643 101.62655
199: 100 Credit_risk_Capital 203.19156 527.26703
200: 100   NameConcentration  30.83511  79.21762

请注意,变量名分别在最后一个 _ 和最后一个 oldNew 之前分开。这是通过使用正则表达式 positive look-ahead 来实现的:"_(?=(New|old)$)"

合并两列并重命名变量

这里,我们取来使用patterns()函数,相当于指定一个度量变量列表。

附带说明:列表(或模式)的顺序决定值列的顺序:

melt(df, measure.vars = patterns("New$", "old$"))
      id variable    value1    value2
  1:   1        1 204.85227 327.57606
  2:   2        1 206.96769 416.64575
  3:   3        1 201.85514 465.06647
  ...
melt(df, measure.vars = patterns("old$", "New$"))
      id variable    value1    value2
  1:   1        1 327.57606 204.85227
  2:   2        1 416.64575 206.96769
  3:   3        1 465.06647 201.85514
  ...

正如 OP 已经指出的那样,融合了多个测量变量

long <- melt(df, measure.vars = patterns("old$", "New$"), 
     variable.name = "CapitalChargeType",
     value.name = c("old", "New")) 

returns 数字而不是变量名:

str(long)
Classes ‘data.table’ and 'data.frame':    200 obs. of  4 variables:
 $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
 $ CapitalChargeType: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ old              : num  328 417 465 259 426 ...
 $ New              : num  205 207 202 207 203 ...
 - attr(*, ".internal.selfref")=<externalptr>

幸运的是,在 forcats 包的帮助下,可以通过替换因子水平轻松更改这些因子:

long[, CapitalChargeType := forcats::lvls_revalue(
  CapitalChargeType, 
  c("Credit_risk_Capital", "NameConcentration"))]
long[order(id)]
      id   CapitalChargeType       old       New
  1:   1 Credit_risk_Capital 327.57606 204.85227
  2:   1   NameConcentration 104.14524  34.20043
  3:   2 Credit_risk_Capital 416.64575 206.96769
  4:   2   NameConcentration  95.25282  30.46721
  5:   3 Credit_risk_Capital 465.06647 201.85514
 ---                                            
196:  98   NameConcentration  90.34097  45.38833
197:  99 Credit_risk_Capital 458.37501 203.53625
198:  99   NameConcentration 101.62655  40.14643
199: 100 Credit_risk_Capital 527.26703 203.19156
200: 100   NameConcentration  79.21762  30.83511

请注意,melt() 按照列在 df 中出现的顺序对变量进行编号。

reshape()

基础 R 的 stats 包有一个 reshape() 函数。不幸的是,它不接受具有积极前瞻性的正则表达式。所以,不能使用自动猜测变量名。相反,必须明确指定所有相关参数:

old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
reshape(df, varying = list(old, new), direction = "long", 
        timevar = "CapitalChargeType",
        times = c("Credit_risk_Capital", "NameConcentration"),
        v.names = c("old", "New"))
      id   CapitalChargeType       old       New
  1:   1 Credit_risk_Capital 367.95567 194.93598
  2:   2 Credit_risk_Capital 467.98061 215.39663
  3:   3 Credit_risk_Capital 363.75586 201.72794
  4:   4 Credit_risk_Capital 433.45070 191.64176
  5:   5 Credit_risk_Capital 408.55776 193.44071
 ---                                            
196:  96   NameConcentration  93.67931  47.85263
197:  97   NameConcentration 101.32361  46.94047
198:  98   NameConcentration 104.80926  33.67270
199:  99   NameConcentration 101.33178  32.28041
200: 100   NameConcentration  85.37136  63.57817

基准测试

该基准包括目前讨论的所有 4 种方法:

  • 修改后使用正则表达式,
  • recast,
  • melt() 多个值变量,并且
  • reshape().

基准数据包含 10 万行:

n_rows <- 100L
set.seed(1234L)
df <- data.table(
  id = c(1:n_rows),
  Credit_risk_Capital_old = rnorm(n_rows, mean = 400, sd = 60),
  NameConcentration_old = rnorm(n_rows, mean = 100, sd = 10),
  Credit_risk_Capital_New = rnorm(n_rows, mean = 200, sd = 10),
  NameConcentration_New = rnorm(n_rows, mean = 40, sd = 10))

为了进行基准测试,使用了 microbenchmark 包:

library(magrittr)
old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
microbenchmark::microbenchmark(
  tidyr = {
    r_tidyr <- df %>% 
      dplyr::as_data_frame() %>%  
      tidyr::gather("key", "value", -id) %>% 
      tidyr::separate(key, c("CapitalChargeType", "age"), sep = "_(?=(New|old)$)") %>% 
      tidyr::spread(age, value)
  },
  recast = {
    r_recast <- dcast(
      melt(df, id.vars = "id")[
        , c("CapitalChargeType", "age") := 
          tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)], 
      id + CapitalChargeType ~ age)
  },
  m2col = {
    r_m2col <- melt(df, measure.vars = patterns("New$", "old$"), 
                    variable.name = "CapitalChargeType",
                    value.name = c("New", "old"))[
                      , CapitalChargeType := forcats::lvls_revalue(
                        CapitalChargeType, 
                        c("Credit_risk_Capital", "NameConcentration"))][order(id)]
  },
  reshape = {
    r_reshape <- reshape(df, varying = list(new, old), direction = "long", 
                         timevar = "CapitalChargeType",
                         times = c("Credit_risk_Capital", "NameConcentration"),
                         v.names = c("New", "old")
    )
  },
  times = 10L
)
Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval
   tidyr 705.20364 789.63010 832.11391 813.08830 825.15259 1091.3188    10
  recast 215.35813 223.60715 287.28034 261.23333 338.36813  477.3355    10
   m2col  10.28721  11.35237  38.72393  14.46307  23.64113  154.3357    10
 reshape 143.75546 171.68592 379.05752 224.13671 269.95301 1730.5892    10

时间显示 melt() 同时处理两列比第二快的 reshape() 快大约 15 倍。 recast 两种变体都落后了,因为它们都需要两次重塑操作。 tidyr 解决方案特别慢。

虽然这个问题很老,但更新的答案可能会帮助那些通过搜索定向到这个问题的人。在data.tablemost recent开发版中,melt有一个新的measure函数,你可以从中做:

df <-data.table(
  id = c(1:100),
  Credit_risk_Capital_old= rnorm(100, mean = 400, sd = 60),
  NameConcentration_old= rnorm(100, mean = 100, sd = 10),
  Credit_risk_Capital_New =rnorm(100, mean = 200, sd = 10),
  NameConcentration_New = rnorm(100, mean = 40, sd = 10)
)

melt(df,
     id.vars = "id",
     measure(CapitalChargeType, value.name,
             pattern = "(.*)_(New|old)"))

获取输出:

        id   CapitalChargeType       old       New
     <int>              <char>     <num>     <num>
  1:     1 Credit_risk_Capital 409.89004 210.30058
  2:     2 Credit_risk_Capital 403.15172 197.26172
  3:     3 Credit_risk_Capital 374.90492 192.21152
  4:     4 Credit_risk_Capital 509.17491 195.39095
  5:     5 Credit_risk_Capital 429.48302 197.44441
 ---                                              
196:    96   NameConcentration  80.64747  37.61926
197:    97   NameConcentration 104.39483  13.86576
198:    98   NameConcentration 106.87475  23.15775
199:    99   NameConcentration 112.92373  44.51562
200:   100   NameConcentration 111.80915  38.40075

新版本应该会在某个时间出现在 CRAN 上,但在那之前,您可以使用开发版本。当版本移至 CRAN 时,我会尝试更新此答案。