在 group_by 内,改变一个新列,根据列的出现顺序获取列的值
Within group_by, mutate a new column that grabs values of a column based on their order of appearance
我正在争论一个具有交叉试验设计的数据集。这是一个具有类似结构的玩具示例:
df <- structure(list(subject = c("a", "a", "a", "a", "a", "a", "b",
"b", "b", "b", "c", "c", "c", "c", "c", "c"), treatment = c("none",
"placebo", "placebo", "drug", "drug", "drug", "none", "drug",
"placebo", "placebo", "none", "placebo", "drug", "drug", "drug",
"drug"), day = c(0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 0, 1, 2, 3, 4,
5)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"
))
# A tibble: 16 × 3
subject treatment day
<chr> <chr> <dbl>
1 a none 0
2 a placebo 1
3 a placebo 2
4 a drug 3
5 a drug 4
6 a drug 5
7 b none 0
8 b drug 1
9 b placebo 2
10 b placebo 3
11 c none 0
12 c placebo 1
13 c drug 2
14 c drug 3
15 c drug 4
16 c drug 5
因此,每个受试者都以 treatment
中的“none”值开始,然后接受几天 placebo
或 drug
治疗,然后等了几天。
我想要的是一个新的 stage
列,它根据 treatment
值的顺序告诉我实验的时间顺序阶段。换句话说,受试者中的起始 none
值将始终是实验的“第一”阶段,该受试者中 treatment
中下一个按时间顺序出现的值将是“第二”阶段,而最后出现的值将是“第三”阶段。
所以我想要的输出看起来像这样:
# A tibble: 16 × 4
subject treatment day stage
<chr> <chr> <dbl> <chr>
1 a none 0 first
2 a placebo 1 second
3 a placebo 2 second
4 a drug 3 third
5 a drug 4 third
6 a drug 5 third
7 b none 0 first
8 b drug 1 second
9 b placebo 2 third
10 b placebo 3 third
11 c none 0 first
12 c placebo 1 second
13 c drug 2 third
14 c drug 3 third
15 c drug 4 third
16 c drug 5 third
对我来说有意义的是将 group_by
和 mutate
与 treatment
的 factor
结合使用,但行不通
#my failed attempt
df %>%
arrange(subject, day) %>% #needed for my actual dataset
group_by(subject) %>%
mutate(stage=factor(treatment, levels=c("first", "second", "third"))) %>%
ungroup(
给出:
# A tibble: 16 × 4
subject treatment day stage
<chr> <chr> <dbl> <fct>
1 a none 0 second
2 a placebo 1 third
3 a placebo 2 third
4 a drug 3 first
5 a drug 4 first
6 a drug 5 first
7 b none 0 second
8 b drug 1 first
9 b placebo 2 third
10 b placebo 3 third
11 c none 0 second
12 c placebo 1 third
13 c drug 2 first
14 c drug 3 first
15 c drug 4 first
16 c drug 5 first
问题是标签是根据“治疗”值的字母顺序显示的,但我希望它们按照每个主题中 treatment
值的出现顺序显示。我也尝试使用 levels
而不是 labels
,我只得到所有 NA
s.
如有任何帮助,我们将不胜感激。 dplyr
解决方案是首选,但很乐意与任何其他解决方案一起使用。
您可以 group_by
主题,然后使用 match
或 rleid
。使用 english::ordinal
获得预期的输出。
df %>%
group_by(subject) %>%
mutate(match = match(treatment, unique(treatment)),
rleid = data.table::rleid(treatment),
stage = english::ordinal(match))
# A tibble: 16 × 6
# Groups: subject [3]
subject treatment day match rleid stage
<chr> <chr> <dbl> <int> <int> <ordinal>
1 a none 0 1 1 first
2 a placebo 1 2 2 second
3 a placebo 2 2 2 second
4 a drug 3 3 3 third
5 a drug 4 3 3 third
6 a drug 5 3 3 third
7 b none 0 1 1 first
8 b drug 1 2 2 second
9 b placebo 2 3 3 third
10 b placebo 3 3 3 third
11 c none 0 1 1 first
12 c placebo 1 2 2 second
13 c drug 2 3 3 third
14 c drug 3 3 3 third
15 c drug 4 3 3 third
16 c drug 5 3 3 third
如果有任何情况在用药后再次给予安慰剂,造成“第四”阶段,那么取决于unique(treatment)
可能会导致错误。
或者,您可以计算治疗变化的累计总和:
library(tidyverse)
df %>%
group_by(subject) %>%
mutate(stage_change = treatment!=lag(treatment),
stage = cumsum(ifelse(is.na(stage_change), 1, stage_change))) %>%
select(-stage_change)
#> # A tibble: 16 x 4
#> # Groups: subject [3]
#> subject treatment day stage
#> <chr> <chr> <dbl> <dbl>
#> 1 a none 0 1
#> 2 a placebo 1 2
#> 3 a placebo 2 2
#> 4 a drug 3 3
#> 5 a drug 4 3
#> 6 a drug 5 3
#> 7 b none 0 1
#> 8 b drug 1 2
#> 9 b placebo 2 3
#> 10 b placebo 3 3
#> 11 c none 0 1
#> 12 c placebo 1 2
#> 13 c drug 2 3
#> 14 c drug 3 3
#> 15 c drug 4 3
#> 16 c drug 5 3
由 reprex package (v2.0.1)
于 2022-05-03 创建
如果需要,您可以使用 english::ordinal(stage)
。
我正在争论一个具有交叉试验设计的数据集。这是一个具有类似结构的玩具示例:
df <- structure(list(subject = c("a", "a", "a", "a", "a", "a", "b",
"b", "b", "b", "c", "c", "c", "c", "c", "c"), treatment = c("none",
"placebo", "placebo", "drug", "drug", "drug", "none", "drug",
"placebo", "placebo", "none", "placebo", "drug", "drug", "drug",
"drug"), day = c(0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 0, 1, 2, 3, 4,
5)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"
))
# A tibble: 16 × 3
subject treatment day
<chr> <chr> <dbl>
1 a none 0
2 a placebo 1
3 a placebo 2
4 a drug 3
5 a drug 4
6 a drug 5
7 b none 0
8 b drug 1
9 b placebo 2
10 b placebo 3
11 c none 0
12 c placebo 1
13 c drug 2
14 c drug 3
15 c drug 4
16 c drug 5
因此,每个受试者都以 treatment
中的“none”值开始,然后接受几天 placebo
或 drug
治疗,然后等了几天。
我想要的是一个新的 stage
列,它根据 treatment
值的顺序告诉我实验的时间顺序阶段。换句话说,受试者中的起始 none
值将始终是实验的“第一”阶段,该受试者中 treatment
中下一个按时间顺序出现的值将是“第二”阶段,而最后出现的值将是“第三”阶段。
所以我想要的输出看起来像这样:
# A tibble: 16 × 4
subject treatment day stage
<chr> <chr> <dbl> <chr>
1 a none 0 first
2 a placebo 1 second
3 a placebo 2 second
4 a drug 3 third
5 a drug 4 third
6 a drug 5 third
7 b none 0 first
8 b drug 1 second
9 b placebo 2 third
10 b placebo 3 third
11 c none 0 first
12 c placebo 1 second
13 c drug 2 third
14 c drug 3 third
15 c drug 4 third
16 c drug 5 third
对我来说有意义的是将 group_by
和 mutate
与 treatment
的 factor
结合使用,但行不通
#my failed attempt
df %>%
arrange(subject, day) %>% #needed for my actual dataset
group_by(subject) %>%
mutate(stage=factor(treatment, levels=c("first", "second", "third"))) %>%
ungroup(
给出:
# A tibble: 16 × 4
subject treatment day stage
<chr> <chr> <dbl> <fct>
1 a none 0 second
2 a placebo 1 third
3 a placebo 2 third
4 a drug 3 first
5 a drug 4 first
6 a drug 5 first
7 b none 0 second
8 b drug 1 first
9 b placebo 2 third
10 b placebo 3 third
11 c none 0 second
12 c placebo 1 third
13 c drug 2 first
14 c drug 3 first
15 c drug 4 first
16 c drug 5 first
问题是标签是根据“治疗”值的字母顺序显示的,但我希望它们按照每个主题中 treatment
值的出现顺序显示。我也尝试使用 levels
而不是 labels
,我只得到所有 NA
s.
如有任何帮助,我们将不胜感激。 dplyr
解决方案是首选,但很乐意与任何其他解决方案一起使用。
您可以 group_by
主题,然后使用 match
或 rleid
。使用 english::ordinal
获得预期的输出。
df %>%
group_by(subject) %>%
mutate(match = match(treatment, unique(treatment)),
rleid = data.table::rleid(treatment),
stage = english::ordinal(match))
# A tibble: 16 × 6
# Groups: subject [3]
subject treatment day match rleid stage
<chr> <chr> <dbl> <int> <int> <ordinal>
1 a none 0 1 1 first
2 a placebo 1 2 2 second
3 a placebo 2 2 2 second
4 a drug 3 3 3 third
5 a drug 4 3 3 third
6 a drug 5 3 3 third
7 b none 0 1 1 first
8 b drug 1 2 2 second
9 b placebo 2 3 3 third
10 b placebo 3 3 3 third
11 c none 0 1 1 first
12 c placebo 1 2 2 second
13 c drug 2 3 3 third
14 c drug 3 3 3 third
15 c drug 4 3 3 third
16 c drug 5 3 3 third
如果有任何情况在用药后再次给予安慰剂,造成“第四”阶段,那么取决于unique(treatment)
可能会导致错误。
或者,您可以计算治疗变化的累计总和:
library(tidyverse)
df %>%
group_by(subject) %>%
mutate(stage_change = treatment!=lag(treatment),
stage = cumsum(ifelse(is.na(stage_change), 1, stage_change))) %>%
select(-stage_change)
#> # A tibble: 16 x 4
#> # Groups: subject [3]
#> subject treatment day stage
#> <chr> <chr> <dbl> <dbl>
#> 1 a none 0 1
#> 2 a placebo 1 2
#> 3 a placebo 2 2
#> 4 a drug 3 3
#> 5 a drug 4 3
#> 6 a drug 5 3
#> 7 b none 0 1
#> 8 b drug 1 2
#> 9 b placebo 2 3
#> 10 b placebo 3 3
#> 11 c none 0 1
#> 12 c placebo 1 2
#> 13 c drug 2 3
#> 14 c drug 3 3
#> 15 c drug 4 3
#> 16 c drug 5 3
由 reprex package (v2.0.1)
于 2022-05-03 创建如果需要,您可以使用 english::ordinal(stage)
。