使用月份(列)和期间(行)制作数据框
Make dataframe wide with months (columns) and periods (rows)
我有一个包含多列月份和总数(总共 13 列)和两个时期(半年,标记为 1 和 7)的数据框。
我正在尝试广泛传播它,并将其呈现为 January1
和 January7
(所有 12 个月 + Total
和 Total1
)。然后我会计算期间之间的差异。
请指教怎么做。
我尝试了 spread()
,老 Hadley 的函数,但是这几个月让键和值的一切变得复杂。也因 pivot_wide()
.
而失败
显然尝试了多个问题,例如 this。
我的数据样本在这里。
structure(list(Year = c(2019, 2019, 2019, 2019, 2019), Period = c(1,
1, 7, 1, 7), KPKV = c(99999, 110000, 111000, 111010, 111010),
KEKV = c(4, 1, 1, 2, 2), Name = c("A", "B", "B", "B", "B"
), January = c(70198346.4, 125181.4, 125181.4, 64008.4, 34374.1
), February = c(71052496.2, 127697.1, 127697.1, 66007.3,
34719.1), March = c(96884031.5, 142375.3, 142375.3, 75510.2,
38082.1), April = c(74389605.4, 139627.8, 139627.8, 75891.9,
37262.5), May = c(101876908, 144649.4, 144649.4, 79889.6,
41150), June = c(86362730.8, 178706.8, 178706.8, 96616, 49727.9
), July = c(74326532.8, 178708.4, 178708.4, 96616, 55955.7
), August = c(80052666.3, 186225.8, 186225.8, 102606.5, 30816.8
), September = c(90236044.8, 182131, 182131, 102885.7, 49123.1
), October = c(79077964, 175287.8, 175287.8, 101166.1, 49942.8
), November = c(92509081.2, 185182.1, 185182.1, 109051.8,
37609.2), December = c(88801141.2, 198270.2, 198270.2, 119648,
37609.2), Total = c(1005767549, 1964043.1, 1964043.1, 1089897.5,
496372.5)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), spec = structure(list(cols = list(
Year = structure(list(), class = c("collector_double", "collector"
)), Period = structure(list(), class = c("collector_double",
"collector")), KPKV = structure(list(), class = c("collector_double",
"collector")), KEKV = structure(list(), class = c("collector_double",
"collector")), Name = structure(list(), class = c("collector_character",
"collector")), January = structure(list(), class = c("collector_double",
"collector")), February = structure(list(), class = c("collector_double",
"collector")), March = structure(list(), class = c("collector_double",
"collector")), April = structure(list(), class = c("collector_double",
"collector")), May = structure(list(), class = c("collector_double",
"collector")), June = structure(list(), class = c("collector_double",
"collector")), July = structure(list(), class = c("collector_double",
"collector")), August = structure(list(), class = c("collector_double",
"collector")), September = structure(list(), class = c("collector_double",
"collector")), October = structure(list(), class = c("collector_double",
"collector")), November = structure(list(), class = c("collector_double",
"collector")), December = structure(list(), class = c("collector_double",
"collector")), Total = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
更新:
使用第一种解决方案后,数据进行了转换,但并非一切正常。一些列丢失了。
我相信这是因为 KPKV
列是唯一的,但是 KEKV
列在同一个 KPKV 下可以有多个值。
我的预期输出
structure(list(Year = 2019, KPKV = 99999, KEKV = 4, Name = "Random name",
April1 = 74389605.4, April7 = NA_real_, August1 = 80052666.3,
August7 = NA_real_, December1 = 88801141.2, December7 = NA_real_,
February1 = 71052496.2, February7 = NA_real_, January1 = 70198346.4,
January7 = NA_real_, July1 = 74326532.8, July7 = NA_real_,
June1 = 86362730.8, June7 = NA_real_, March1 = 96884031.5,
March7 = NA_real_, May1 = 101876908, May7 = NA_real_, November1 =
92509081.2,
November7 = NA_real_, October1 = 79077964, October7 = NA_real_,
September1 = 90236044.8, September7 = NA_real_, Total1 = 1005767548.6,
Total7 = NA_real_), row.names = 1L, class = "data.frame")
使用基础 R reshape
函数:
reshape(data.frame(df),idvar = "Name",timevar = "Period",dir="wide",sep="")
Name Year1 KPKV1 KEKV1 January1 February1 March1 April1 May1 June1 July1 August1 September1
1 A 2019 110000 1 70198346.4 71052496.2 96884031.5 74389605.4 101876908.0 86362730.8 74326532.8 80052666.3 90236045
2 B 2019 110000 1 125181.4 127697.1 142375.3 139627.8 144649.4 178706.8 178708.4 186225.8 182131
October1 November1 December1 Total1 Year7 KPKV7 KEKV7 January7 February7 March7 April7 May7 June7
1 79077964.0 92509081.2 88801141.2 1005767549 2019 111000 1 125181.4 127697.1 142375.3 139627.8 144649.4 178706.8
2 175287.8 185182.1 198270.2 1964043 2019 111010 1 64008.4 66007.3 75510.2 75891.9 79889.6 96616.0
July7 August7 September7 October7 November7 December7 Total7
1 178708.4 186225.8 182131.0 175287.8 185182.1 198270.2 1964043
2 96616.0 102606.5 102885.7 101166.1 109051.8 119648.0 1089898
一个选项是 pivot_wider
来自 tidyr
的开发版本
library(tidyr) #‘0.8.3.9000’
library(dplyr)
df1 %>%
pivot_wider(id_cols = Name, names_from = Period,
values_from = c(January:December), names_sep = "")
# A tibble: 2 x 25
# Name January1 January7 February1 February7 March1 March7 April1 April7 May1 May7 June1 June7 July1 July7 August1 August7 September1 September7
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 7.02e7 125181. 71052496. 127697. 9.69e7 1.42e5 7.44e7 1.40e5 1.02e8 #1.45e5 8.64e7 1.79e5 7.43e7 1.79e5 8.01e7 186226. 90236045. 182131
#2 B 1.25e5 64008. 127697. 66007. 1.42e5 7.55e4 1.40e5 7.59e4 1.45e5 7.99e4 1.79e5 9.66e4 1.79e5 9.66e4 1.86e5 102606. 182131 102886.
# … with 6 more variables: October1 <dbl>, October7 <dbl>, November1 <dbl>, November7 <dbl>, December1 <dbl>, December7 <dbl>
或与以前的 tidyr
版本
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, January:December) %>%
unite(key, key, Period, sep="") %>%
spread(key, val)
或data.table
选项
library(data.table)
dcast(setDT(df1), Name + KPKV + Year ~ Period, value.var = month.name, sep="")
我有一个包含多列月份和总数(总共 13 列)和两个时期(半年,标记为 1 和 7)的数据框。
我正在尝试广泛传播它,并将其呈现为 January1
和 January7
(所有 12 个月 + Total
和 Total1
)。然后我会计算期间之间的差异。
请指教怎么做。
我尝试了 spread()
,老 Hadley 的函数,但是这几个月让键和值的一切变得复杂。也因 pivot_wide()
.
显然尝试了多个问题,例如 this。
我的数据样本在这里。
structure(list(Year = c(2019, 2019, 2019, 2019, 2019), Period = c(1,
1, 7, 1, 7), KPKV = c(99999, 110000, 111000, 111010, 111010),
KEKV = c(4, 1, 1, 2, 2), Name = c("A", "B", "B", "B", "B"
), January = c(70198346.4, 125181.4, 125181.4, 64008.4, 34374.1
), February = c(71052496.2, 127697.1, 127697.1, 66007.3,
34719.1), March = c(96884031.5, 142375.3, 142375.3, 75510.2,
38082.1), April = c(74389605.4, 139627.8, 139627.8, 75891.9,
37262.5), May = c(101876908, 144649.4, 144649.4, 79889.6,
41150), June = c(86362730.8, 178706.8, 178706.8, 96616, 49727.9
), July = c(74326532.8, 178708.4, 178708.4, 96616, 55955.7
), August = c(80052666.3, 186225.8, 186225.8, 102606.5, 30816.8
), September = c(90236044.8, 182131, 182131, 102885.7, 49123.1
), October = c(79077964, 175287.8, 175287.8, 101166.1, 49942.8
), November = c(92509081.2, 185182.1, 185182.1, 109051.8,
37609.2), December = c(88801141.2, 198270.2, 198270.2, 119648,
37609.2), Total = c(1005767549, 1964043.1, 1964043.1, 1089897.5,
496372.5)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), spec = structure(list(cols = list(
Year = structure(list(), class = c("collector_double", "collector"
)), Period = structure(list(), class = c("collector_double",
"collector")), KPKV = structure(list(), class = c("collector_double",
"collector")), KEKV = structure(list(), class = c("collector_double",
"collector")), Name = structure(list(), class = c("collector_character",
"collector")), January = structure(list(), class = c("collector_double",
"collector")), February = structure(list(), class = c("collector_double",
"collector")), March = structure(list(), class = c("collector_double",
"collector")), April = structure(list(), class = c("collector_double",
"collector")), May = structure(list(), class = c("collector_double",
"collector")), June = structure(list(), class = c("collector_double",
"collector")), July = structure(list(), class = c("collector_double",
"collector")), August = structure(list(), class = c("collector_double",
"collector")), September = structure(list(), class = c("collector_double",
"collector")), October = structure(list(), class = c("collector_double",
"collector")), November = structure(list(), class = c("collector_double",
"collector")), December = structure(list(), class = c("collector_double",
"collector")), Total = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
更新:
使用第一种解决方案后,数据进行了转换,但并非一切正常。一些列丢失了。
我相信这是因为 KPKV
列是唯一的,但是 KEKV
列在同一个 KPKV 下可以有多个值。
我的预期输出
structure(list(Year = 2019, KPKV = 99999, KEKV = 4, Name = "Random name",
April1 = 74389605.4, April7 = NA_real_, August1 = 80052666.3,
August7 = NA_real_, December1 = 88801141.2, December7 = NA_real_,
February1 = 71052496.2, February7 = NA_real_, January1 = 70198346.4,
January7 = NA_real_, July1 = 74326532.8, July7 = NA_real_,
June1 = 86362730.8, June7 = NA_real_, March1 = 96884031.5,
March7 = NA_real_, May1 = 101876908, May7 = NA_real_, November1 =
92509081.2,
November7 = NA_real_, October1 = 79077964, October7 = NA_real_,
September1 = 90236044.8, September7 = NA_real_, Total1 = 1005767548.6,
Total7 = NA_real_), row.names = 1L, class = "data.frame")
使用基础 R reshape
函数:
reshape(data.frame(df),idvar = "Name",timevar = "Period",dir="wide",sep="")
Name Year1 KPKV1 KEKV1 January1 February1 March1 April1 May1 June1 July1 August1 September1
1 A 2019 110000 1 70198346.4 71052496.2 96884031.5 74389605.4 101876908.0 86362730.8 74326532.8 80052666.3 90236045
2 B 2019 110000 1 125181.4 127697.1 142375.3 139627.8 144649.4 178706.8 178708.4 186225.8 182131
October1 November1 December1 Total1 Year7 KPKV7 KEKV7 January7 February7 March7 April7 May7 June7
1 79077964.0 92509081.2 88801141.2 1005767549 2019 111000 1 125181.4 127697.1 142375.3 139627.8 144649.4 178706.8
2 175287.8 185182.1 198270.2 1964043 2019 111010 1 64008.4 66007.3 75510.2 75891.9 79889.6 96616.0
July7 August7 September7 October7 November7 December7 Total7
1 178708.4 186225.8 182131.0 175287.8 185182.1 198270.2 1964043
2 96616.0 102606.5 102885.7 101166.1 109051.8 119648.0 1089898
一个选项是 pivot_wider
来自 tidyr
library(tidyr) #‘0.8.3.9000’
library(dplyr)
df1 %>%
pivot_wider(id_cols = Name, names_from = Period,
values_from = c(January:December), names_sep = "")
# A tibble: 2 x 25
# Name January1 January7 February1 February7 March1 March7 April1 April7 May1 May7 June1 June7 July1 July7 August1 August7 September1 September7
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 7.02e7 125181. 71052496. 127697. 9.69e7 1.42e5 7.44e7 1.40e5 1.02e8 #1.45e5 8.64e7 1.79e5 7.43e7 1.79e5 8.01e7 186226. 90236045. 182131
#2 B 1.25e5 64008. 127697. 66007. 1.42e5 7.55e4 1.40e5 7.59e4 1.45e5 7.99e4 1.79e5 9.66e4 1.79e5 9.66e4 1.86e5 102606. 182131 102886.
# … with 6 more variables: October1 <dbl>, October7 <dbl>, November1 <dbl>, November7 <dbl>, December1 <dbl>, December7 <dbl>
或与以前的 tidyr
版本
library(dplyr)
library(tidyr)
df1 %>%
gather(key, val, January:December) %>%
unite(key, key, Period, sep="") %>%
spread(key, val)
或data.table
选项
library(data.table)
dcast(setDT(df1), Name + KPKV + Year ~ Period, value.var = month.name, sep="")