从宽变量组重塑到长变量组

reshape from wide to long group of variables

这个问题与已经存在的问题非常相似

但是我无法将其扩展到多组变量。这是我正在处理的数据集

A tibble: 12 x 9
   Month Cabo_BU_PCT Acapulco_BU_PCT Cabo_LOS_AVG Acapulco_LOS_AVG BED_BUGS_Cabo BED_BUGS_Acapulco TOTAL_OCCUPIED_Cabo TOTAL_OCCUPIED_Acapulco

       1   0.6470034       0.6260116     5.223000         4.307667             5                 3               19216                    6498
       2   0.6167027       0.6777457     5.893571         4.247500             3                 0               17095                    6566
       3   0.6372108       0.6348126     5.229677         4.327742             5                 1               19556                    6809
       4   0.6357912       0.6548170     5.356667         4.220000             4                 6               18883                    6797
       5   0.6449006       0.6409659     5.344194         4.162903             2                 5               19792                    6875
       6   0.6747811       0.6935453     5.812667         4.362000             4                 3               20041                    7199
       7   0.6697947       0.6932687     5.544516         4.462903             5                 6               20556                    7436
       8   0.6595960       0.6777923     5.260323         4.135806             0                 7               20243                    7270
       9   0.6792256       0.6863198     5.424333         4.133333             5                 0               20173                    7124
      10   0.6976214       0.7370875     5.419677         4.350000             3                 3               21410                    7906
      11   0.6600337       0.6615607     5.450000         4.184333             3                 2               19603                    6867
      12   0.6761812       0.6773261     5.347097         4.318710             2                 2               20752                    7265

我的目标是将其重塑为如下所示的长格式,其中列 Cabo_BU_PCT Acapulco_BU_PCT 被转换为列名称 BU_PCT 下的长格式,类似的列 Cabo_LOS_AVG Acapulco_LOS_AVG在列名 LOS_AVG 下转换为长格式,依此类推。

  Month    Location    BU_PCT      LOS_AVG     BED_BUGS       TOTAL_OCCUPIED
  1        Cabo        0.6470034   5.223000    5              19216
  1        Acapulco    0.6260116   4.307667    3              6498
  2        Cabo        0.6167027   5.893571    3              17095
  2        Acapulco    0.6777457   4.247500    0              6566
  .
  .
  .
  12       Cabo        0.6761812   5.347097    2              20752
  12       Acapulco    0.6773261   4.318710    2              7265  

非常感谢您对重塑此数据框的任何帮助。谢谢。

========数据集===========

df_wide <- structure(list(Month = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
), Cabo_BU_PCT = c(0.647003367003367, 0.616702741702742, 0.637210817855979, 
0.635791245791246, 0.644900619094168, 0.674781144781145, 0.669794721407625, 
0.65959595959596, 0.679225589225589, 0.69762137504073, 0.66003367003367, 
0.676181166503747), Acapulco_BU_PCT = c(0.626011560693642, 0.677745664739884, 
0.634812604885325, 0.654816955684008, 0.640965877307477, 0.69354527938343, 
0.693268692895767, 0.677792280440052, 0.686319845857418, 0.737087451053515, 
0.661560693641619, 0.677326123438374), Cabo_LOS_AVG = c(5.223, 
5.89357142857143, 5.22967741935484, 5.35666666666667, 5.3441935483871, 
5.81266666666667, 5.54451612903226, 5.26032258064516, 5.42433333333333, 
5.41967741935484, 5.45, 5.34709677419355), Acapulco_LOS_AVG = c(4.30766666666667, 
4.2475, 4.32774193548387, 4.22, 4.16290322580645, 4.362, 4.46290322580645, 
4.1358064516129, 4.13333333333333, 4.35, 4.18433333333333, 4.31870967741935
), BED_BUGS_Cabo = c(5, 3, 5, 4, 2, 4, 5, 0, 5, 3, 3, 2), BED_BUGS_Acapulco = c(3, 
0, 1, 6, 5, 3, 6, 7, 0, 3, 2, 2), TOTAL_OCCUPIED_Cabo = c(19216, 
17095, 19556, 18883, 19792, 20041, 20556, 20243, 20173, 21410, 
19603, 20752), TOTAL_OCCUPIED_Acapulco = c(6498, 6566, 6809, 
6797, 6875, 7199, 7436, 7270, 7124, 7906, 6867, 7265)), class = c("tbl_df", 
"tbl", "data.frame"), .Names = c("Month", "Cabo_BU_PCT", "Acapulco_BU_PCT", 
"Cabo_LOS_AVG", "Acapulco_LOS_AVG", "BED_BUGS_Cabo", "BED_BUGS_Acapulco", 
"TOTAL_OCCUPIED_Cabo", "TOTAL_OCCUPIED_Acapulco"), row.names = c(NA, 
-12L))

如果您只有两个位置,您可以将它们放入正则表达式中,考虑到它们可能位于名称的开头或结尾:

library(tidyverse)

df_wide %>% 
    gather(variable, value, -Month) %>% 
    mutate(location = sub('.*(Cabo|Acapulco).*', '\1', variable), 
           variable = sub('_?(Cabo|Acapulco)_?', '', variable)) %>% 
    spread(variable, value)
#> # A tibble: 24 x 6
#>    Month location BED_BUGS    BU_PCT  LOS_AVG TOTAL_OCCUPIED
#>  * <dbl>    <chr>    <dbl>     <dbl>    <dbl>          <dbl>
#>  1     1 Acapulco        3 0.6260116 4.307667           6498
#>  2     1     Cabo        5 0.6470034 5.223000          19216
#>  3     2 Acapulco        0 0.6777457 4.247500           6566
#>  4     2     Cabo        3 0.6167027 5.893571          17095
#>  5     3 Acapulco        1 0.6348126 4.327742           6809
#>  6     3     Cabo        5 0.6372108 5.229677          19556
#>  7     4 Acapulco        6 0.6548170 4.220000           6797
#>  8     4     Cabo        4 0.6357912 5.356667          18883
#>  9     5 Acapulco        5 0.6409659 4.162903           6875
#> 10     5     Cabo        2 0.6449006 5.344194          19792
#> # ... with 14 more rows

这使用来自基础 R 的 reshape。没有使用包。 varying= 指定要合并第 2 列和第 3 列,第 4 列和第 5 列等。新列的名称在 v.names= 中指定,位置在 times= 中指定。

我们可以从标题中推导出 varying=v.names=times= 参数,但考虑到它们的不规则性,它涉及一个混乱的正则表达式,因此将它们写出来更简单(但是,我们将在下面进一步展示如何做到这一点)。

结果按位置排序,然后按位置内的月份排序,但如果需要可以重新排序。

df_long <- reshape(df_wide, dir = "long", 
 varying = list(2:3, 4:5, 6:7, 8:9),
 v.names = c("BU_OCT", "LOS_AVG", "BED_BUGS", "TOTAL_OCCUPIED"),
 times = c("Cabo", "Acupuloc"))[-7]
names(df_long)[2] <- "LOCATION"

或者,如果我们确实想从 names(df_wide) 中导出 varying=v.names=times=,可以这样做,其中 names1names(df_wide) 没有位置名称。我们使用位置名称由除第一个字母外的小写字母组成的事实,并开始或结束每个名称。

names1 <- names(df_wide)[-1]
pat <- "(.[a-z]+)_(.*)|(.*)_(.[a-z]+)"
varying <- split(names1, sub(pat, "\2\3", names1))
v.names <- names(varying)
locations <- unique(sub(pat, "\1\4", names1))

df_long <- reshape(df_wide, dir = "long", varying = varying, v.names = v.names, 
     times = locations)[-7]
names(df_long)[2] <- "LOCATION"

由于 spreadgather 已弃用,我提供了一个基于@alistaire 的答案:

library(tidyverse)
df_wide %>%  
  pivot_longer(-Month)  %>%  
  mutate(location = sub('.*(Cabo|Acapulco).*', '\1', name), 
         name  = sub('_?(Cabo|Acapulco)_?', '', name)) %>% 
  pivot_wider()
# ------ Outputs below ------
# A tibble: 24 × 6
   Month location BU_PCT LOS_AVG BED_BUGS TOTAL_OCCUPIED
   <dbl> <chr>     <dbl>   <dbl>    <dbl>          <dbl>
 1     1 Cabo      0.647    5.22        5          19216
 2     1 Acapulco  0.626    4.31        3           6498
 3     2 Cabo      0.617    5.89        3          17095
 4     2 Acapulco  0.678    4.25        0           6566
 5     3 Cabo      0.637    5.23        5          19556
 6     3 Acapulco  0.635    4.33        1           6809
 7     4 Cabo      0.636    5.36        4          18883
 8     4 Acapulco  0.655    4.22        6           6797
 9     5 Cabo      0.645    5.34        2          19792
10     5 Acapulco  0.641    4.16        5           6875
# … with 14 more rows