在随时间重复的条件下进行总结
Summarizing within Conditions Repeated over Time
我正在尝试使用条件随时间以不同间隔重复的数据集来汇总时间间隔内的数据。我想获得每个条件在时间间隔内的均值和标准差。
但是,在我的真实数据中,我不知道每个条件会有多少个间隔。我想也许我可以通过从一行到下一行的条件变化来指示间隔的结束。但我不知道如何编码。
library(tidyverse)
df <- data.frame(Condition = c(rep("A", 50),
rep("B", 60),
rep("C", 50),
rep("A", 60),
rep("B", 50),
rep("C", 50)),
Time = c(seq(160, 190, length.out = 50),
seq(190.05, 230, length.out = 60),
seq(230.05, 260, length.out = 50),
seq(260.05, 293, length.out = 60),
seq(293.05, 321, length.out = 50),
seq(321.05, 352, length.out = 50))
) %>%
rowwise() %>%
mutate(X = rnorm(1.4, 0.3))
我正在尝试为每个条件间隔(由数字组成)计算 mean(X) 和 sd(X):
Condition interval mean(X) sd(X)
A [160,190] 1.4 0.32
B [190.05,230] 1.46 0.36
C [230.05,260] 1.32 0.26
A [260.05,293] 1.5 0.40
B [293.05,321] 1.25 0.34
C [321.05,352] 1.43 0.41
我已经试过了,但它不能满足我的需要:
df %>%
group_by(Condition) %>%
mutate(interval = cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = F,
right = F)) %>%
group_by(Condition, interval) %>%
summarise( mean.X = mean(X),
sd.X = sd(X))
这没有给我每个条件的第二个间隔:
Condition interval mean.X sd.X
<chr> <fct> <dbl> <dbl>
1 A [160,293) 0.231 0.991
2 A NA 1.61 NA
3 B [190,321) 0.421 0.893
4 B NA 0.249 NA
5 C [230,352) 0.193 0.898
6 C NA 0.427 NA
有什么建议吗?
我绝对认为应该有一种不那么混乱的方法,但是 kmeans()
给出了以下可能的解决方案:
library(tidyverse)
set.seed(100)
df <- data.frame(Condition = c(rep("A", 50),
rep("B", 60),
rep("C", 50),
rep("A", 60),
rep("B", 50),
rep("C", 50)),
Time = c(seq(160, 190, length.out = 50),
seq(190.05, 230, length.out = 60),
seq(230.05, 260, length.out = 50),
seq(260.05, 293, length.out = 60),
seq(293.05, 321, length.out = 50),
seq(321.05, 352, length.out = 50))
) %>%
rowwise() %>%
mutate(X = rnorm(1.4, 0.3))
df %>%
group_by(Condition) %>%
mutate(Block = kmeans(Time, 2)$cluster) %>%
group_by(Condition, Block) %>%
mutate(interval = as.character(cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = T,
right = T))) %>%
group_by(Condition, interval) %>%
summarise(mean.X = mean(X),
sd.X = sd(X)) %>%
arrange(Condition, interval)
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups: Condition [3]
#> Condition interval mean.X sd.X
#> <chr> <chr> <dbl> <dbl>
#> 1 A [160,190] 0.382 0.819
#> 2 A [260,293] 0.277 0.940
#> 3 B [190,230] 0.229 1.14
#> 4 B [293,321] 0.303 1.08
#> 5 C [230,260] 0.265 0.755
#> 6 C [321,352] 0.301 0.900
NA
的处理方式由您决定。
编辑 1:
添加了 @Sinh Nguyen 的 cut
改进。
编辑 2:回应更新后的问题:
我们可以使用 data.table
中的 rleid()
函数
library(tidyverse)
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
set.seed(100)
df <- data.frame(Condition = c(rep("A", 50),
rep("B", 60),
rep("C", 50),
rep("A", 60),
rep("B", 50),
rep("C", 50)),
Time = c(seq(160, 190, length.out = 50),
seq(190.05, 230, length.out = 60),
seq(230.05, 260, length.out = 50),
seq(260.05, 293, length.out = 60),
seq(293.05, 321, length.out = 50),
seq(321.05, 352, length.out = 50))
) %>%
rowwise() %>%
mutate(X = rnorm(1.4, 0.3))
Block <- rleid(df$Condition)
df %>%
add_column(Block) %>%
group_by(Condition, Block) %>%
mutate(interval = paste0("[", min(Time), ",", max(Time), "]")) %>%
group_by(Condition, interval) %>%
summarise(mean.X = mean(X), sd.X = sd(X))
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups: Condition [3]
#> Condition interval mean.X sd.X
#> <chr> <chr> <dbl> <dbl>
#> 1 A [160,190] 0.382 0.819
#> 2 A [260.05,293] 0.277 0.940
#> 3 B [190.05,230] 0.229 1.14
#> 4 B [293.05,321] 0.303 1.08
#> 5 C [230.05,260] 0.265 0.755
#> 6 C [321.05,352] 0.301 0.900
第二个间隔组具有 NA 值的原因是由于您对 cut
函数的输入,其中 right = F
具有 Time == max(Time)
的结果记录将被排除在间隔之外输出。
df %>%
group_by(Condition) %>%
mutate(interval = cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = F, right = F)) %>%
filter(is.na(interval))
#> # A tibble: 3 x 4
#> # Groups: Condition [3]
#> Condition Time X interval
#> <chr> <dbl> <dbl> <fct>
#> 1 A 293 -1.52 <NA>
#> 2 B 321 1.35 <NA>
#> 3 C 352 0.758 <NA>
正如您在上面看到的那样,每组有一个具有 NA
间隔的记录。
如果您将 cut
参数更改为 right = T
和 include.lowest = T
那么您将包括所有这些。
df %>%
group_by(Condition) %>%
mutate(interval = cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = T, right = T)) %>%
group_by(Condition, interval) %>%
summarise( mean.X = mean(X),
sd.X = sd(X))
#> # A tibble: 3 x 4
#> # Groups: Condition [3]
#> Condition interval mean.X sd.X
#> <chr> <fct> <dbl> <dbl>
#> 1 A [160,293] 0.230 0.963
#> 2 B [190,321] 0.124 0.961
#> 3 C [230,352] 0.146 0.961
如果这不是您所期望的,请进一步说明您希望间隔如何。,
由 reprex package (v2.0.1)
于 2022-05-16 创建
我们可以使用 rle
来定义您的条件的“组”。
library(dplyr)
df %>%
ungroup() %>%
mutate(group = rep(1:length(rle(Condition)$lengths), rle(Condition)$lengths)) %>%
group_by(group) %>%
summarize(Condition = unique(Condition),
interval = paste0("[", range(Time)[1], ",", range(Time)[2], "]"),
mean_X = mean(X),
sd_X = sd(X))
# A tibble: 6 × 5
group Condition interval mean_X sd_X
<int> <chr> <chr> <dbl> <dbl>
1 1 A [160,190] 0.160 0.926
2 2 B [190.05,230] 0.0258 0.990
3 3 C [230.05,260] 0.296 1.03
4 4 A [260.05,293] 0.472 1.08
5 5 B [293.05,321] 0.0363 1.08
6 6 C [321.05,352] 0.361 1.10
我正在尝试使用条件随时间以不同间隔重复的数据集来汇总时间间隔内的数据。我想获得每个条件在时间间隔内的均值和标准差。
但是,在我的真实数据中,我不知道每个条件会有多少个间隔。我想也许我可以通过从一行到下一行的条件变化来指示间隔的结束。但我不知道如何编码。
library(tidyverse)
df <- data.frame(Condition = c(rep("A", 50),
rep("B", 60),
rep("C", 50),
rep("A", 60),
rep("B", 50),
rep("C", 50)),
Time = c(seq(160, 190, length.out = 50),
seq(190.05, 230, length.out = 60),
seq(230.05, 260, length.out = 50),
seq(260.05, 293, length.out = 60),
seq(293.05, 321, length.out = 50),
seq(321.05, 352, length.out = 50))
) %>%
rowwise() %>%
mutate(X = rnorm(1.4, 0.3))
我正在尝试为每个条件间隔(由数字组成)计算 mean(X) 和 sd(X):
Condition interval mean(X) sd(X)
A [160,190] 1.4 0.32
B [190.05,230] 1.46 0.36
C [230.05,260] 1.32 0.26
A [260.05,293] 1.5 0.40
B [293.05,321] 1.25 0.34
C [321.05,352] 1.43 0.41
我已经试过了,但它不能满足我的需要:
df %>%
group_by(Condition) %>%
mutate(interval = cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = F,
right = F)) %>%
group_by(Condition, interval) %>%
summarise( mean.X = mean(X),
sd.X = sd(X))
这没有给我每个条件的第二个间隔:
Condition interval mean.X sd.X
<chr> <fct> <dbl> <dbl>
1 A [160,293) 0.231 0.991
2 A NA 1.61 NA
3 B [190,321) 0.421 0.893
4 B NA 0.249 NA
5 C [230,352) 0.193 0.898
6 C NA 0.427 NA
有什么建议吗?
我绝对认为应该有一种不那么混乱的方法,但是 kmeans()
给出了以下可能的解决方案:
library(tidyverse)
set.seed(100)
df <- data.frame(Condition = c(rep("A", 50),
rep("B", 60),
rep("C", 50),
rep("A", 60),
rep("B", 50),
rep("C", 50)),
Time = c(seq(160, 190, length.out = 50),
seq(190.05, 230, length.out = 60),
seq(230.05, 260, length.out = 50),
seq(260.05, 293, length.out = 60),
seq(293.05, 321, length.out = 50),
seq(321.05, 352, length.out = 50))
) %>%
rowwise() %>%
mutate(X = rnorm(1.4, 0.3))
df %>%
group_by(Condition) %>%
mutate(Block = kmeans(Time, 2)$cluster) %>%
group_by(Condition, Block) %>%
mutate(interval = as.character(cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = T,
right = T))) %>%
group_by(Condition, interval) %>%
summarise(mean.X = mean(X),
sd.X = sd(X)) %>%
arrange(Condition, interval)
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups: Condition [3]
#> Condition interval mean.X sd.X
#> <chr> <chr> <dbl> <dbl>
#> 1 A [160,190] 0.382 0.819
#> 2 A [260,293] 0.277 0.940
#> 3 B [190,230] 0.229 1.14
#> 4 B [293,321] 0.303 1.08
#> 5 C [230,260] 0.265 0.755
#> 6 C [321,352] 0.301 0.900
NA
的处理方式由您决定。
编辑 1:
添加了 @Sinh Nguyen 的 cut
改进。
编辑 2:回应更新后的问题:
我们可以使用 data.table
rleid()
函数
library(tidyverse)
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
set.seed(100)
df <- data.frame(Condition = c(rep("A", 50),
rep("B", 60),
rep("C", 50),
rep("A", 60),
rep("B", 50),
rep("C", 50)),
Time = c(seq(160, 190, length.out = 50),
seq(190.05, 230, length.out = 60),
seq(230.05, 260, length.out = 50),
seq(260.05, 293, length.out = 60),
seq(293.05, 321, length.out = 50),
seq(321.05, 352, length.out = 50))
) %>%
rowwise() %>%
mutate(X = rnorm(1.4, 0.3))
Block <- rleid(df$Condition)
df %>%
add_column(Block) %>%
group_by(Condition, Block) %>%
mutate(interval = paste0("[", min(Time), ",", max(Time), "]")) %>%
group_by(Condition, interval) %>%
summarise(mean.X = mean(X), sd.X = sd(X))
#> `summarise()` has grouped output by 'Condition'. You can override using the
#> `.groups` argument.
#> # A tibble: 6 × 4
#> # Groups: Condition [3]
#> Condition interval mean.X sd.X
#> <chr> <chr> <dbl> <dbl>
#> 1 A [160,190] 0.382 0.819
#> 2 A [260.05,293] 0.277 0.940
#> 3 B [190.05,230] 0.229 1.14
#> 4 B [293.05,321] 0.303 1.08
#> 5 C [230.05,260] 0.265 0.755
#> 6 C [321.05,352] 0.301 0.900
第二个间隔组具有 NA 值的原因是由于您对 cut
函数的输入,其中 right = F
具有 Time == max(Time)
的结果记录将被排除在间隔之外输出。
df %>%
group_by(Condition) %>%
mutate(interval = cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = F, right = F)) %>%
filter(is.na(interval))
#> # A tibble: 3 x 4
#> # Groups: Condition [3]
#> Condition Time X interval
#> <chr> <dbl> <dbl> <fct>
#> 1 A 293 -1.52 <NA>
#> 2 B 321 1.35 <NA>
#> 3 C 352 0.758 <NA>
正如您在上面看到的那样,每组有一个具有 NA
间隔的记录。
如果您将 cut
参数更改为 right = T
和 include.lowest = T
那么您将包括所有这些。
df %>%
group_by(Condition) %>%
mutate(interval = cut(Time,
breaks = c(floor(min(Time)), ceiling(max(Time))),
include.lowest = T, right = T)) %>%
group_by(Condition, interval) %>%
summarise( mean.X = mean(X),
sd.X = sd(X))
#> # A tibble: 3 x 4
#> # Groups: Condition [3]
#> Condition interval mean.X sd.X
#> <chr> <fct> <dbl> <dbl>
#> 1 A [160,293] 0.230 0.963
#> 2 B [190,321] 0.124 0.961
#> 3 C [230,352] 0.146 0.961
如果这不是您所期望的,请进一步说明您希望间隔如何。,
由 reprex package (v2.0.1)
于 2022-05-16 创建我们可以使用 rle
来定义您的条件的“组”。
library(dplyr)
df %>%
ungroup() %>%
mutate(group = rep(1:length(rle(Condition)$lengths), rle(Condition)$lengths)) %>%
group_by(group) %>%
summarize(Condition = unique(Condition),
interval = paste0("[", range(Time)[1], ",", range(Time)[2], "]"),
mean_X = mean(X),
sd_X = sd(X))
# A tibble: 6 × 5
group Condition interval mean_X sd_X
<int> <chr> <chr> <dbl> <dbl>
1 1 A [160,190] 0.160 0.926
2 2 B [190.05,230] 0.0258 0.990
3 3 C [230.05,260] 0.296 1.03
4 4 A [260.05,293] 0.472 1.08
5 5 B [293.05,321] 0.0363 1.08
6 6 C [321.05,352] 0.361 1.10