在随时间重复的条件下进行总结

Summarizing within Conditions Repeated over Time

我正在尝试使用条件随时间以不同间隔重复的数据集来汇总时间间隔内的数据。我想获得每个条件在时间间隔内的均值和标准差。

但是,在我的真实数据中,我不知道每个条件会有多少个间隔。我想也许我可以通过从一行到下一行的条件变化来指示间隔的结束。但我不知道如何编码。

library(tidyverse)

df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
        rowwise() %>%
        mutate(X = rnorm(1.4, 0.3))

我正在尝试为每个条件间隔(由数字组成)计算 mean(X) 和 sd(X):

Condition   interval        mean(X)   sd(X)
A            [160,190]       1.4      0.32
B            [190.05,230]    1.46     0.36
C            [230.05,260]    1.32     0.26
A            [260.05,293]    1.5      0.40
B            [293.05,321]    1.25     0.34
C            [321.05,352]    1.43     0.41

我已经试过了,但它不能满足我的需要:

df %>%  
        group_by(Condition) %>%
        mutate(interval = cut(Time,
                              breaks = c(floor(min(Time)), ceiling(max(Time))),
                              include.lowest = F, 
                              right = F)) %>%
        group_by(Condition, interval) %>% 
        summarise( mean.X = mean(X),
                   sd.X = sd(X))

这没有给我每个条件的第二个间隔:

  Condition interval  mean.X   sd.X
  <chr>     <fct>      <dbl>  <dbl>
1 A         [160,293)  0.231  0.991
2 A         NA         1.61  NA    
3 B         [190,321)  0.421  0.893
4 B         NA         0.249 NA    
5 C         [230,352)  0.193  0.898
6 C         NA         0.427 NA   

有什么建议吗?

我绝对认为应该有一种不那么混乱的方法,但是 kmeans() 给出了以下可能的解决方案:

library(tidyverse)

set.seed(100)
df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
  rowwise() %>%
  mutate(X = rnorm(1.4, 0.3))

df %>% 
  group_by(Condition) %>% 
  mutate(Block = kmeans(Time, 2)$cluster) %>% 
  group_by(Condition, Block) %>% 
  mutate(interval = as.character(cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = T, 
                        right = T))) %>%
  group_by(Condition, interval) %>% 
  summarise(mean.X = mean(X),
            sd.X = sd(X)) %>% 
  arrange(Condition, interval)
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups:   Condition [3]
#>   Condition interval  mean.X  sd.X
#>   <chr>     <chr>      <dbl> <dbl>
#> 1 A         [160,190]  0.382 0.819
#> 2 A         [260,293]  0.277 0.940
#> 3 B         [190,230]  0.229 1.14 
#> 4 B         [293,321]  0.303 1.08 
#> 5 C         [230,260]  0.265 0.755
#> 6 C         [321,352]  0.301 0.900

NA的处理方式由您决定。

编辑 1:

添加了 @Sinh Nguyencut 改进。

编辑 2:回应更新后的问题:

我们可以使用 data.table

中的 rleid() 函数
library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose

set.seed(100)
df <- data.frame(Condition = c(rep("A", 50), 
                               rep("B", 60), 
                               rep("C", 50),
                               rep("A", 60), 
                               rep("B", 50), 
                               rep("C", 50)),
                 Time = c(seq(160, 190, length.out = 50), 
                          seq(190.05, 230, length.out = 60), 
                          seq(230.05, 260, length.out = 50),
                          seq(260.05, 293, length.out = 60), 
                          seq(293.05, 321, length.out = 50), 
                          seq(321.05, 352, length.out = 50))
) %>%
  rowwise() %>%
  mutate(X = rnorm(1.4, 0.3))

Block <- rleid(df$Condition)
df %>% 
  add_column(Block) %>% 
  group_by(Condition, Block) %>% 
  mutate(interval = paste0("[", min(Time), ",", max(Time), "]")) %>%
  group_by(Condition, interval) %>% 
  summarise(mean.X = mean(X), sd.X = sd(X))
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups:   Condition [3]
#>   Condition interval     mean.X  sd.X
#>   <chr>     <chr>         <dbl> <dbl>
#> 1 A         [160,190]     0.382 0.819
#> 2 A         [260.05,293]  0.277 0.940
#> 3 B         [190.05,230]  0.229 1.14 
#> 4 B         [293.05,321]  0.303 1.08 
#> 5 C         [230.05,260]  0.265 0.755
#> 6 C         [321.05,352]  0.301 0.900

第二个间隔组具有 NA 值的原因是由于您对 cut 函数的输入,其中 right = F 具有 Time == max(Time) 的结果记录将被排除在间隔之外输出。

df %>%  
  group_by(Condition) %>%
  mutate(interval = cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = F, right = F)) %>%
  filter(is.na(interval))
#> # A tibble: 3 x 4
#> # Groups:   Condition [3]
#>   Condition  Time      X interval
#>   <chr>     <dbl>  <dbl> <fct>   
#> 1 A           293 -1.52  <NA>    
#> 2 B           321  1.35  <NA>    
#> 3 C           352  0.758 <NA>

正如您在上面看到的那样,每组有一个具有 NA 间隔的记录。 如果您将 cut 参数更改为 right = Tinclude.lowest = T 那么您将包括所有这些。

df %>%  
  group_by(Condition) %>%
  mutate(interval = cut(Time,
                        breaks = c(floor(min(Time)), ceiling(max(Time))),
                        include.lowest = T, right = T)) %>%
  group_by(Condition, interval) %>% 
  summarise( mean.X = mean(X),
             sd.X = sd(X))

#> # A tibble: 3 x 4
#> # Groups:   Condition [3]
#>   Condition interval  mean.X  sd.X
#>   <chr>     <fct>      <dbl> <dbl>
#> 1 A         [160,293]  0.230 0.963
#> 2 B         [190,321]  0.124 0.961
#> 3 C         [230,352]  0.146 0.961

如果这不是您所期望的,请进一步说明您希望间隔如何。,

reprex package (v2.0.1)

于 2022-05-16 创建

我们可以使用 rle 来定义您的条件的“组”。

library(dplyr)

df %>% 
  ungroup() %>% 
  mutate(group = rep(1:length(rle(Condition)$lengths), rle(Condition)$lengths)) %>% 
  group_by(group) %>% 
  summarize(Condition = unique(Condition),
            interval = paste0("[", range(Time)[1], ",", range(Time)[2], "]"), 
            mean_X = mean(X), 
            sd_X = sd(X))

# A tibble: 6 × 5
  group Condition interval     mean_X  sd_X
  <int> <chr>     <chr>         <dbl> <dbl>
1     1 A         [160,190]    0.160  0.926
2     2 B         [190.05,230] 0.0258 0.990
3     3 C         [230.05,260] 0.296  1.03 
4     4 A         [260.05,293] 0.472  1.08 
5     5 B         [293.05,321] 0.0363 1.08 
6     6 C         [321.05,352] 0.361  1.10