从宽变量组重塑到长变量组
reshape from wide to long group of variables
这个问题与已经存在的问题非常相似。
但是我无法将其扩展到多组变量。这是我正在处理的数据集
A tibble: 12 x 9
Month Cabo_BU_PCT Acapulco_BU_PCT Cabo_LOS_AVG Acapulco_LOS_AVG BED_BUGS_Cabo BED_BUGS_Acapulco TOTAL_OCCUPIED_Cabo TOTAL_OCCUPIED_Acapulco
1 0.6470034 0.6260116 5.223000 4.307667 5 3 19216 6498
2 0.6167027 0.6777457 5.893571 4.247500 3 0 17095 6566
3 0.6372108 0.6348126 5.229677 4.327742 5 1 19556 6809
4 0.6357912 0.6548170 5.356667 4.220000 4 6 18883 6797
5 0.6449006 0.6409659 5.344194 4.162903 2 5 19792 6875
6 0.6747811 0.6935453 5.812667 4.362000 4 3 20041 7199
7 0.6697947 0.6932687 5.544516 4.462903 5 6 20556 7436
8 0.6595960 0.6777923 5.260323 4.135806 0 7 20243 7270
9 0.6792256 0.6863198 5.424333 4.133333 5 0 20173 7124
10 0.6976214 0.7370875 5.419677 4.350000 3 3 21410 7906
11 0.6600337 0.6615607 5.450000 4.184333 3 2 19603 6867
12 0.6761812 0.6773261 5.347097 4.318710 2 2 20752 7265
我的目标是将其重塑为如下所示的长格式,其中列 Cabo_BU_PCT Acapulco_BU_PCT
被转换为列名称 BU_PCT
下的长格式,类似的列 Cabo_LOS_AVG Acapulco_LOS_AVG
在列名 LOS_AVG 下转换为长格式,依此类推。
Month Location BU_PCT LOS_AVG BED_BUGS TOTAL_OCCUPIED
1 Cabo 0.6470034 5.223000 5 19216
1 Acapulco 0.6260116 4.307667 3 6498
2 Cabo 0.6167027 5.893571 3 17095
2 Acapulco 0.6777457 4.247500 0 6566
.
.
.
12 Cabo 0.6761812 5.347097 2 20752
12 Acapulco 0.6773261 4.318710 2 7265
非常感谢您对重塑此数据框的任何帮助。谢谢。
========数据集===========
df_wide <- structure(list(Month = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
), Cabo_BU_PCT = c(0.647003367003367, 0.616702741702742, 0.637210817855979,
0.635791245791246, 0.644900619094168, 0.674781144781145, 0.669794721407625,
0.65959595959596, 0.679225589225589, 0.69762137504073, 0.66003367003367,
0.676181166503747), Acapulco_BU_PCT = c(0.626011560693642, 0.677745664739884,
0.634812604885325, 0.654816955684008, 0.640965877307477, 0.69354527938343,
0.693268692895767, 0.677792280440052, 0.686319845857418, 0.737087451053515,
0.661560693641619, 0.677326123438374), Cabo_LOS_AVG = c(5.223,
5.89357142857143, 5.22967741935484, 5.35666666666667, 5.3441935483871,
5.81266666666667, 5.54451612903226, 5.26032258064516, 5.42433333333333,
5.41967741935484, 5.45, 5.34709677419355), Acapulco_LOS_AVG = c(4.30766666666667,
4.2475, 4.32774193548387, 4.22, 4.16290322580645, 4.362, 4.46290322580645,
4.1358064516129, 4.13333333333333, 4.35, 4.18433333333333, 4.31870967741935
), BED_BUGS_Cabo = c(5, 3, 5, 4, 2, 4, 5, 0, 5, 3, 3, 2), BED_BUGS_Acapulco = c(3,
0, 1, 6, 5, 3, 6, 7, 0, 3, 2, 2), TOTAL_OCCUPIED_Cabo = c(19216,
17095, 19556, 18883, 19792, 20041, 20556, 20243, 20173, 21410,
19603, 20752), TOTAL_OCCUPIED_Acapulco = c(6498, 6566, 6809,
6797, 6875, 7199, 7436, 7270, 7124, 7906, 6867, 7265)), class = c("tbl_df",
"tbl", "data.frame"), .Names = c("Month", "Cabo_BU_PCT", "Acapulco_BU_PCT",
"Cabo_LOS_AVG", "Acapulco_LOS_AVG", "BED_BUGS_Cabo", "BED_BUGS_Acapulco",
"TOTAL_OCCUPIED_Cabo", "TOTAL_OCCUPIED_Acapulco"), row.names = c(NA,
-12L))
如果您只有两个位置,您可以将它们放入正则表达式中,考虑到它们可能位于名称的开头或结尾:
library(tidyverse)
df_wide %>%
gather(variable, value, -Month) %>%
mutate(location = sub('.*(Cabo|Acapulco).*', '\1', variable),
variable = sub('_?(Cabo|Acapulco)_?', '', variable)) %>%
spread(variable, value)
#> # A tibble: 24 x 6
#> Month location BED_BUGS BU_PCT LOS_AVG TOTAL_OCCUPIED
#> * <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Acapulco 3 0.6260116 4.307667 6498
#> 2 1 Cabo 5 0.6470034 5.223000 19216
#> 3 2 Acapulco 0 0.6777457 4.247500 6566
#> 4 2 Cabo 3 0.6167027 5.893571 17095
#> 5 3 Acapulco 1 0.6348126 4.327742 6809
#> 6 3 Cabo 5 0.6372108 5.229677 19556
#> 7 4 Acapulco 6 0.6548170 4.220000 6797
#> 8 4 Cabo 4 0.6357912 5.356667 18883
#> 9 5 Acapulco 5 0.6409659 4.162903 6875
#> 10 5 Cabo 2 0.6449006 5.344194 19792
#> # ... with 14 more rows
这使用来自基础 R 的 reshape
。没有使用包。 varying=
指定要合并第 2 列和第 3 列,第 4 列和第 5 列等。新列的名称在 v.names=
中指定,位置在 times=
中指定。
我们可以从标题中推导出 varying=
、v.names=
和 times=
参数,但考虑到它们的不规则性,它涉及一个混乱的正则表达式,因此将它们写出来更简单(但是,我们将在下面进一步展示如何做到这一点)。
结果按位置排序,然后按位置内的月份排序,但如果需要可以重新排序。
df_long <- reshape(df_wide, dir = "long",
varying = list(2:3, 4:5, 6:7, 8:9),
v.names = c("BU_OCT", "LOS_AVG", "BED_BUGS", "TOTAL_OCCUPIED"),
times = c("Cabo", "Acupuloc"))[-7]
names(df_long)[2] <- "LOCATION"
或者,如果我们确实想从 names(df_wide)
中导出 varying=
、v.names=
和 times=
,可以这样做,其中 names1
是 names(df_wide)
没有位置名称。我们使用位置名称由除第一个字母外的小写字母组成的事实,并开始或结束每个名称。
names1 <- names(df_wide)[-1]
pat <- "(.[a-z]+)_(.*)|(.*)_(.[a-z]+)"
varying <- split(names1, sub(pat, "\2\3", names1))
v.names <- names(varying)
locations <- unique(sub(pat, "\1\4", names1))
df_long <- reshape(df_wide, dir = "long", varying = varying, v.names = v.names,
times = locations)[-7]
names(df_long)[2] <- "LOCATION"
由于 spread
和 gather
已弃用,我提供了一个基于@alistaire 的答案:
library(tidyverse)
df_wide %>%
pivot_longer(-Month) %>%
mutate(location = sub('.*(Cabo|Acapulco).*', '\1', name),
name = sub('_?(Cabo|Acapulco)_?', '', name)) %>%
pivot_wider()
# ------ Outputs below ------
# A tibble: 24 × 6
Month location BU_PCT LOS_AVG BED_BUGS TOTAL_OCCUPIED
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Cabo 0.647 5.22 5 19216
2 1 Acapulco 0.626 4.31 3 6498
3 2 Cabo 0.617 5.89 3 17095
4 2 Acapulco 0.678 4.25 0 6566
5 3 Cabo 0.637 5.23 5 19556
6 3 Acapulco 0.635 4.33 1 6809
7 4 Cabo 0.636 5.36 4 18883
8 4 Acapulco 0.655 4.22 6 6797
9 5 Cabo 0.645 5.34 2 19792
10 5 Acapulco 0.641 4.16 5 6875
# … with 14 more rows
这个问题与已经存在的问题非常相似
但是我无法将其扩展到多组变量。这是我正在处理的数据集
A tibble: 12 x 9
Month Cabo_BU_PCT Acapulco_BU_PCT Cabo_LOS_AVG Acapulco_LOS_AVG BED_BUGS_Cabo BED_BUGS_Acapulco TOTAL_OCCUPIED_Cabo TOTAL_OCCUPIED_Acapulco
1 0.6470034 0.6260116 5.223000 4.307667 5 3 19216 6498
2 0.6167027 0.6777457 5.893571 4.247500 3 0 17095 6566
3 0.6372108 0.6348126 5.229677 4.327742 5 1 19556 6809
4 0.6357912 0.6548170 5.356667 4.220000 4 6 18883 6797
5 0.6449006 0.6409659 5.344194 4.162903 2 5 19792 6875
6 0.6747811 0.6935453 5.812667 4.362000 4 3 20041 7199
7 0.6697947 0.6932687 5.544516 4.462903 5 6 20556 7436
8 0.6595960 0.6777923 5.260323 4.135806 0 7 20243 7270
9 0.6792256 0.6863198 5.424333 4.133333 5 0 20173 7124
10 0.6976214 0.7370875 5.419677 4.350000 3 3 21410 7906
11 0.6600337 0.6615607 5.450000 4.184333 3 2 19603 6867
12 0.6761812 0.6773261 5.347097 4.318710 2 2 20752 7265
我的目标是将其重塑为如下所示的长格式,其中列 Cabo_BU_PCT Acapulco_BU_PCT
被转换为列名称 BU_PCT
下的长格式,类似的列 Cabo_LOS_AVG Acapulco_LOS_AVG
在列名 LOS_AVG 下转换为长格式,依此类推。
Month Location BU_PCT LOS_AVG BED_BUGS TOTAL_OCCUPIED
1 Cabo 0.6470034 5.223000 5 19216
1 Acapulco 0.6260116 4.307667 3 6498
2 Cabo 0.6167027 5.893571 3 17095
2 Acapulco 0.6777457 4.247500 0 6566
.
.
.
12 Cabo 0.6761812 5.347097 2 20752
12 Acapulco 0.6773261 4.318710 2 7265
非常感谢您对重塑此数据框的任何帮助。谢谢。
========数据集===========
df_wide <- structure(list(Month = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
), Cabo_BU_PCT = c(0.647003367003367, 0.616702741702742, 0.637210817855979,
0.635791245791246, 0.644900619094168, 0.674781144781145, 0.669794721407625,
0.65959595959596, 0.679225589225589, 0.69762137504073, 0.66003367003367,
0.676181166503747), Acapulco_BU_PCT = c(0.626011560693642, 0.677745664739884,
0.634812604885325, 0.654816955684008, 0.640965877307477, 0.69354527938343,
0.693268692895767, 0.677792280440052, 0.686319845857418, 0.737087451053515,
0.661560693641619, 0.677326123438374), Cabo_LOS_AVG = c(5.223,
5.89357142857143, 5.22967741935484, 5.35666666666667, 5.3441935483871,
5.81266666666667, 5.54451612903226, 5.26032258064516, 5.42433333333333,
5.41967741935484, 5.45, 5.34709677419355), Acapulco_LOS_AVG = c(4.30766666666667,
4.2475, 4.32774193548387, 4.22, 4.16290322580645, 4.362, 4.46290322580645,
4.1358064516129, 4.13333333333333, 4.35, 4.18433333333333, 4.31870967741935
), BED_BUGS_Cabo = c(5, 3, 5, 4, 2, 4, 5, 0, 5, 3, 3, 2), BED_BUGS_Acapulco = c(3,
0, 1, 6, 5, 3, 6, 7, 0, 3, 2, 2), TOTAL_OCCUPIED_Cabo = c(19216,
17095, 19556, 18883, 19792, 20041, 20556, 20243, 20173, 21410,
19603, 20752), TOTAL_OCCUPIED_Acapulco = c(6498, 6566, 6809,
6797, 6875, 7199, 7436, 7270, 7124, 7906, 6867, 7265)), class = c("tbl_df",
"tbl", "data.frame"), .Names = c("Month", "Cabo_BU_PCT", "Acapulco_BU_PCT",
"Cabo_LOS_AVG", "Acapulco_LOS_AVG", "BED_BUGS_Cabo", "BED_BUGS_Acapulco",
"TOTAL_OCCUPIED_Cabo", "TOTAL_OCCUPIED_Acapulco"), row.names = c(NA,
-12L))
如果您只有两个位置,您可以将它们放入正则表达式中,考虑到它们可能位于名称的开头或结尾:
library(tidyverse)
df_wide %>%
gather(variable, value, -Month) %>%
mutate(location = sub('.*(Cabo|Acapulco).*', '\1', variable),
variable = sub('_?(Cabo|Acapulco)_?', '', variable)) %>%
spread(variable, value)
#> # A tibble: 24 x 6
#> Month location BED_BUGS BU_PCT LOS_AVG TOTAL_OCCUPIED
#> * <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Acapulco 3 0.6260116 4.307667 6498
#> 2 1 Cabo 5 0.6470034 5.223000 19216
#> 3 2 Acapulco 0 0.6777457 4.247500 6566
#> 4 2 Cabo 3 0.6167027 5.893571 17095
#> 5 3 Acapulco 1 0.6348126 4.327742 6809
#> 6 3 Cabo 5 0.6372108 5.229677 19556
#> 7 4 Acapulco 6 0.6548170 4.220000 6797
#> 8 4 Cabo 4 0.6357912 5.356667 18883
#> 9 5 Acapulco 5 0.6409659 4.162903 6875
#> 10 5 Cabo 2 0.6449006 5.344194 19792
#> # ... with 14 more rows
这使用来自基础 R 的 reshape
。没有使用包。 varying=
指定要合并第 2 列和第 3 列,第 4 列和第 5 列等。新列的名称在 v.names=
中指定,位置在 times=
中指定。
我们可以从标题中推导出 varying=
、v.names=
和 times=
参数,但考虑到它们的不规则性,它涉及一个混乱的正则表达式,因此将它们写出来更简单(但是,我们将在下面进一步展示如何做到这一点)。
结果按位置排序,然后按位置内的月份排序,但如果需要可以重新排序。
df_long <- reshape(df_wide, dir = "long",
varying = list(2:3, 4:5, 6:7, 8:9),
v.names = c("BU_OCT", "LOS_AVG", "BED_BUGS", "TOTAL_OCCUPIED"),
times = c("Cabo", "Acupuloc"))[-7]
names(df_long)[2] <- "LOCATION"
或者,如果我们确实想从 names(df_wide)
中导出 varying=
、v.names=
和 times=
,可以这样做,其中 names1
是 names(df_wide)
没有位置名称。我们使用位置名称由除第一个字母外的小写字母组成的事实,并开始或结束每个名称。
names1 <- names(df_wide)[-1]
pat <- "(.[a-z]+)_(.*)|(.*)_(.[a-z]+)"
varying <- split(names1, sub(pat, "\2\3", names1))
v.names <- names(varying)
locations <- unique(sub(pat, "\1\4", names1))
df_long <- reshape(df_wide, dir = "long", varying = varying, v.names = v.names,
times = locations)[-7]
names(df_long)[2] <- "LOCATION"
由于 spread
和 gather
已弃用,我提供了一个基于@alistaire 的答案:
library(tidyverse)
df_wide %>%
pivot_longer(-Month) %>%
mutate(location = sub('.*(Cabo|Acapulco).*', '\1', name),
name = sub('_?(Cabo|Acapulco)_?', '', name)) %>%
pivot_wider()
# ------ Outputs below ------
# A tibble: 24 × 6
Month location BU_PCT LOS_AVG BED_BUGS TOTAL_OCCUPIED
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Cabo 0.647 5.22 5 19216
2 1 Acapulco 0.626 4.31 3 6498
3 2 Cabo 0.617 5.89 3 17095
4 2 Acapulco 0.678 4.25 0 6566
5 3 Cabo 0.637 5.23 5 19556
6 3 Acapulco 0.635 4.33 1 6809
7 4 Cabo 0.636 5.36 4 18883
8 4 Acapulco 0.655 4.22 6 6797
9 5 Cabo 0.645 5.34 2 19792
10 5 Acapulco 0.641 4.16 5 6875
# … with 14 more rows