是否有 R 函数可以撤消 cumsum() 并重新创建数据集中的原始非累积列?
Is there a R function which can undo cumsum() and recreate the original non-cumulative column in a dataset?
为简单起见,我创建了一个小型虚拟数据集。
请注意:日期格式为 yyyy-mm-dd
这是数据集 DF:
DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
# A tibble: 12 x 3
country date visits
<chr> <chr> <dbl>
1 France 2020-01-01 10
2 France 2020-01-02 16
3 France 2020-01-03 14
4 France 2020-01-04 12
5 England 2020-01-01 11
6 England 2020-01-02 9
7 England 2020-01-03 12
8 England 2020-01-04 14
9 Spain 2020-01-01 13
10 Spain 2020-01-02 13
11 Spain 2020-01-03 15
12 Spain 2020-01-04 10
这是数据集 DFc:
DFc <- DF %>% group_by(country) %>% mutate(cumulative_visits = cumsum(visits))
# A tibble: 12 x 3
# Groups: country [3]
country date cumulative_visits
<chr> <chr> <dbl>
1 France 2020-01-01 10
2 France 2020-01-02 26
3 France 2020-01-03 40
4 France 2020-01-04 52
5 England 2020-01-01 11
6 England 2020-01-02 20
7 England 2020-01-03 32
8 England 2020-01-04 46
9 Spain 2020-01-01 13
10 Spain 2020-01-02 26
11 Spain 2020-01-03 41
12 Spain 2020-01-04 51
假设我只有数据集 DFc。我可以使用哪些 R 函数来重新创建访问列(如数据集 DF 中所示)和本质上“undo/reverse”cumsum()?
有人告诉我可以合并 lag() 函数,但我不确定该怎么做。
此外,如果日期间隔数周而不是一天,代码将如何更改?
任何帮助将不胜感激:)
从您的玩具示例开始:
library(dplyr)
DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
DF <- DF %>%
group_by(country) %>%
mutate(cumulative_visits = cumsum(visits)) %>%
ungroup()
我建议你两种方法:
- 差异
- 滞后[根据您的具体要求]
DF %>%
group_by(country) %>%
mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>%
ungroup()
#> # A tibble: 12 x 6
#> country date visits cumulative_visits decum_visits1 decum_visits2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 France 2020-01-01 10 10 10 10
#> 2 France 2020-02-01 16 26 16 16
#> 3 France 2020-03-01 14 40 14 14
#> 4 France 2020-04-01 12 52 12 12
#> 5 England 2020-01-01 11 11 11 11
#> 6 England 2020-02-01 9 20 9 9
#> 7 England 2020-03-01 12 32 12 12
#> 8 England 2020-04-01 14 46 14 14
#> 9 Spain 2020-01-01 13 13 13 13
#> 10 Spain 2020-02-01 13 26 13 13
#> 11 Spain 2020-03-01 15 41 15 15
#> 12 Spain 2020-04-01 10 51 10 10
如果缺少一个日期,比方说,如下例所示:
DF1 <- DF %>%
# set to date!
mutate(date = as.Date(date)) %>%
# remove one date just for the sake of the example
filter(date != as.Date("2020-02-01"))
然后我建议您 complete
日期,而您 fill
visits
使用零, cumulative_visits
使用最后一次看到的值。那么就可以像之前一样得到cumsum
的反义词
DF1 %>%
group_by(country) %>%
# complete and fill with zero!
tidyr::complete(date = seq.Date(min(date), max(date), by = "month"), fill = list(visits = 0)) %>%
# fill cumulative with the last available value
tidyr::fill(cumulative_visits) %>%
# reset in the same way
mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>%
ungroup()
这是一个通用的解决方案。这很草率,因为如您所见,这不是 return foo[1]
但可以修复。 (可以反转最后一行的输出。)我将把它“作为 reader 的练习”。
foo <- sample(1:20,10)
[1] 16 11 13 5 6 12 19 10 3 4
bar <- cumsum(foo)
[1] 16 27 40 45 51 63 82 92 95 99
rev(bar[-1])-rev(bar[-length(bar)])
[1] 4 3 10 19 12 6 5 13 11
为简单起见,我创建了一个小型虚拟数据集。
请注意:日期格式为 yyyy-mm-dd
这是数据集 DF:
DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
# A tibble: 12 x 3
country date visits
<chr> <chr> <dbl>
1 France 2020-01-01 10
2 France 2020-01-02 16
3 France 2020-01-03 14
4 France 2020-01-04 12
5 England 2020-01-01 11
6 England 2020-01-02 9
7 England 2020-01-03 12
8 England 2020-01-04 14
9 Spain 2020-01-01 13
10 Spain 2020-01-02 13
11 Spain 2020-01-03 15
12 Spain 2020-01-04 10
这是数据集 DFc:
DFc <- DF %>% group_by(country) %>% mutate(cumulative_visits = cumsum(visits))
# A tibble: 12 x 3
# Groups: country [3]
country date cumulative_visits
<chr> <chr> <dbl>
1 France 2020-01-01 10
2 France 2020-01-02 26
3 France 2020-01-03 40
4 France 2020-01-04 52
5 England 2020-01-01 11
6 England 2020-01-02 20
7 England 2020-01-03 32
8 England 2020-01-04 46
9 Spain 2020-01-01 13
10 Spain 2020-01-02 26
11 Spain 2020-01-03 41
12 Spain 2020-01-04 51
假设我只有数据集 DFc。我可以使用哪些 R 函数来重新创建访问列(如数据集 DF 中所示)和本质上“undo/reverse”cumsum()?
有人告诉我可以合并 lag() 函数,但我不确定该怎么做。
此外,如果日期间隔数周而不是一天,代码将如何更改?
任何帮助将不胜感激:)
从您的玩具示例开始:
library(dplyr)
DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
DF <- DF %>%
group_by(country) %>%
mutate(cumulative_visits = cumsum(visits)) %>%
ungroup()
我建议你两种方法:
- 差异
- 滞后[根据您的具体要求]
DF %>%
group_by(country) %>%
mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>%
ungroup()
#> # A tibble: 12 x 6
#> country date visits cumulative_visits decum_visits1 decum_visits2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 France 2020-01-01 10 10 10 10
#> 2 France 2020-02-01 16 26 16 16
#> 3 France 2020-03-01 14 40 14 14
#> 4 France 2020-04-01 12 52 12 12
#> 5 England 2020-01-01 11 11 11 11
#> 6 England 2020-02-01 9 20 9 9
#> 7 England 2020-03-01 12 32 12 12
#> 8 England 2020-04-01 14 46 14 14
#> 9 Spain 2020-01-01 13 13 13 13
#> 10 Spain 2020-02-01 13 26 13 13
#> 11 Spain 2020-03-01 15 41 15 15
#> 12 Spain 2020-04-01 10 51 10 10
如果缺少一个日期,比方说,如下例所示:
DF1 <- DF %>%
# set to date!
mutate(date = as.Date(date)) %>%
# remove one date just for the sake of the example
filter(date != as.Date("2020-02-01"))
然后我建议您 complete
日期,而您 fill
visits
使用零, cumulative_visits
使用最后一次看到的值。那么就可以像之前一样得到cumsum
的反义词
DF1 %>%
group_by(country) %>%
# complete and fill with zero!
tidyr::complete(date = seq.Date(min(date), max(date), by = "month"), fill = list(visits = 0)) %>%
# fill cumulative with the last available value
tidyr::fill(cumulative_visits) %>%
# reset in the same way
mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>%
ungroup()
这是一个通用的解决方案。这很草率,因为如您所见,这不是 return foo[1]
但可以修复。 (可以反转最后一行的输出。)我将把它“作为 reader 的练习”。
foo <- sample(1:20,10)
[1] 16 11 13 5 6 12 19 10 3 4
bar <- cumsum(foo)
[1] 16 27 40 45 51 63 82 92 95 99
rev(bar[-1])-rev(bar[-length(bar)])
[1] 4 3 10 19 12 6 5 13 11