R中具有多组变量的组的反向累积和
Reverse cumulative sum by groups in R with multiple group variable
我有包含三个变量的数据:日期、年龄组和药物的累积剂量。每天有多个观察(每个年龄组一个)。我需要保留原始数据中的行数和变量数,但还要添加第四个变量,代表相关日期相关组的实际剂量数。
我已经尝试了 的解决方案,但没有成功。我收到有关引入 NA 的 mutate 函数的警告。代码没有出错,但是我在新变量中得到的数字不正确。其中一些是 NA,就像警告说的那样,有些甚至是负面的。我认为这可能与我认为我需要分组的两个变量而且都不是数字的事实有关,但我不确定。在对另一个 SO post 使用解决方案之前,我尝试将组变量强制转换为数字,但结果存在相同的问题。
这是一个与我的特征相似的虚拟数据集:
structure(list(test_dates = structure(c(17897, 17897, 17897,
17897, 17897, 17898, 17898, 17898, 17898, 17898, 17899, 17899,
17899, 17899, 17899, 17900, 17900, 17900, 17900, 17900, 17901,
17901, 17901, 17901, 17901, 17902, 17902, 17902, 17902, 17902,
17903, 17903, 17903, 17903, 17903, 17904, 17904, 17904, 17904,
17904, 17905, 17905, 17905, 17905, 17905, 17906, 17906, 17906,
17906, 17906), class = "Date"), test_ages = structure(c(1L, 5L,
3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L,
2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L,
4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L
), .Label = c("<18", "18-29", "30-39", "40-49", "50+"), class = c("ordered",
"factor")), cumudose = c(50, 200, 300, 400, 20, 60, 220, 317,
450, 28, 90, 330, 350, 460, 38, 150, 400, 400, 500, 50, 175,
453, 429, 574, 69, 182, 491, 474, 601, 102, 205, 506, 491, 682,
176, 235, 516, 568, 821, 199, 250, 525, 596, 850, 260, 294, 533,
667, 888, 277)), row.names = c(NA, -50L), class = "data.frame")
当前数据框的前 10 行如下所示:
test_dates
test_ages
cumudose
2019-01-01
<18
50
2019-01-01
50+
200
2019-01-01
30-39
300
2019-01-01
18-29
400
2019-01-01
40-49
20
2019-01-02
<18
60
2019-01-02
50+
220
2019-01-02
30-39
317
2019-01-02
18-29
450
2019-01-02
40-49
28
我希望添加新变量后数据看起来像这样:
test_dates
test_ages
cumudose
numdose
2019-01-01
<18
50
50
2019-01-01
50+
200
200
2019-01-01
30-39
300
300
2019-01-01
18-29
400
400
2019-01-01
40-49
20
20
2019-01-02
<18
60
10
2019-01-02
50+
220
20
2019-01-02
30-39
317
17
2019-01-02
18-29
450
50
2019-01-02
40-49
28
8
如果我可以提供任何其他信息,请告诉我!
我们可能需要 diff
erence
library(dplyr)
out <- df1 %>%
group_by(test_ages) %>%
mutate(numdose = c(first(cumudose), diff(cumudose))) %>%
ungroup
-输出
> out
# A tibble: 50 x 4
test_dates test_ages cumudose numdose
<date> <ord> <dbl> <dbl>
1 2019-01-01 <18 50 50
2 2019-01-01 50+ 200 200
3 2019-01-01 30-39 300 300
4 2019-01-01 18-29 400 400
5 2019-01-01 40-49 20 20
6 2019-01-02 <18 60 10
7 2019-01-02 50+ 220 20
8 2019-01-02 30-39 317 17
9 2019-01-02 18-29 450 50
10 2019-01-02 40-49 28 8
# … with 40 more rows
或者lag
和当前值
之间的差异
df1 %>%
group_by(test_ages) %>%
mutate(numdose = coalesce(cumudose - lag(cumudose), cumudose)) %>%
ungroup
# A tibble: 50 x 4
test_dates test_ages cumudose numdose
<date> <ord> <dbl> <dbl>
1 2019-01-01 <18 50 50
2 2019-01-01 50+ 200 200
3 2019-01-01 30-39 300 300
4 2019-01-01 18-29 400 400
5 2019-01-01 40-49 20 20
6 2019-01-02 <18 60 10
7 2019-01-02 50+ 220 20
8 2019-01-02 30-39 317 17
9 2019-01-02 18-29 450 50
10 2019-01-02 40-49 28 8
# … with 40 more rows
我有包含三个变量的数据:日期、年龄组和药物的累积剂量。每天有多个观察(每个年龄组一个)。我需要保留原始数据中的行数和变量数,但还要添加第四个变量,代表相关日期相关组的实际剂量数。
我已经尝试了
这是一个与我的特征相似的虚拟数据集:
structure(list(test_dates = structure(c(17897, 17897, 17897,
17897, 17897, 17898, 17898, 17898, 17898, 17898, 17899, 17899,
17899, 17899, 17899, 17900, 17900, 17900, 17900, 17900, 17901,
17901, 17901, 17901, 17901, 17902, 17902, 17902, 17902, 17902,
17903, 17903, 17903, 17903, 17903, 17904, 17904, 17904, 17904,
17904, 17905, 17905, 17905, 17905, 17905, 17906, 17906, 17906,
17906, 17906), class = "Date"), test_ages = structure(c(1L, 5L,
3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L,
2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L,
4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L
), .Label = c("<18", "18-29", "30-39", "40-49", "50+"), class = c("ordered",
"factor")), cumudose = c(50, 200, 300, 400, 20, 60, 220, 317,
450, 28, 90, 330, 350, 460, 38, 150, 400, 400, 500, 50, 175,
453, 429, 574, 69, 182, 491, 474, 601, 102, 205, 506, 491, 682,
176, 235, 516, 568, 821, 199, 250, 525, 596, 850, 260, 294, 533,
667, 888, 277)), row.names = c(NA, -50L), class = "data.frame")
当前数据框的前 10 行如下所示:
test_dates | test_ages | cumudose |
---|---|---|
2019-01-01 | <18 | 50 |
2019-01-01 | 50+ | 200 |
2019-01-01 | 30-39 | 300 |
2019-01-01 | 18-29 | 400 |
2019-01-01 | 40-49 | 20 |
2019-01-02 | <18 | 60 |
2019-01-02 | 50+ | 220 |
2019-01-02 | 30-39 | 317 |
2019-01-02 | 18-29 | 450 |
2019-01-02 | 40-49 | 28 |
我希望添加新变量后数据看起来像这样:
test_dates | test_ages | cumudose | numdose |
---|---|---|---|
2019-01-01 | <18 | 50 | 50 |
2019-01-01 | 50+ | 200 | 200 |
2019-01-01 | 30-39 | 300 | 300 |
2019-01-01 | 18-29 | 400 | 400 |
2019-01-01 | 40-49 | 20 | 20 |
2019-01-02 | <18 | 60 | 10 |
2019-01-02 | 50+ | 220 | 20 |
2019-01-02 | 30-39 | 317 | 17 |
2019-01-02 | 18-29 | 450 | 50 |
2019-01-02 | 40-49 | 28 | 8 |
如果我可以提供任何其他信息,请告诉我!
我们可能需要 diff
erence
library(dplyr)
out <- df1 %>%
group_by(test_ages) %>%
mutate(numdose = c(first(cumudose), diff(cumudose))) %>%
ungroup
-输出
> out
# A tibble: 50 x 4
test_dates test_ages cumudose numdose
<date> <ord> <dbl> <dbl>
1 2019-01-01 <18 50 50
2 2019-01-01 50+ 200 200
3 2019-01-01 30-39 300 300
4 2019-01-01 18-29 400 400
5 2019-01-01 40-49 20 20
6 2019-01-02 <18 60 10
7 2019-01-02 50+ 220 20
8 2019-01-02 30-39 317 17
9 2019-01-02 18-29 450 50
10 2019-01-02 40-49 28 8
# … with 40 more rows
或者lag
和当前值
df1 %>%
group_by(test_ages) %>%
mutate(numdose = coalesce(cumudose - lag(cumudose), cumudose)) %>%
ungroup
# A tibble: 50 x 4
test_dates test_ages cumudose numdose
<date> <ord> <dbl> <dbl>
1 2019-01-01 <18 50 50
2 2019-01-01 50+ 200 200
3 2019-01-01 30-39 300 300
4 2019-01-01 18-29 400 400
5 2019-01-01 40-49 20 20
6 2019-01-02 <18 60 10
7 2019-01-02 50+ 220 20
8 2019-01-02 30-39 317 17
9 2019-01-02 18-29 450 50
10 2019-01-02 40-49 28 8
# … with 40 more rows