使用 data.table() 和 dplyr() 在 R 中按月逐年进行子集化和汇总(总和、平均值和标准差)多级数据
Subsetting and Summarising (sum, mean, and standard deviation) Multi-level data by month over years using data.table() and dplyr() in R
问题
我有一个名为 'FID' 的数据框(见下文),其中包含三年内每月 FID 的频率。
我想通过使用包 data.table 和 dplyr 计算 3 年内 FID 的总频率 来对我的数据框进行子集化, 然后我想总结我的数据,所以它包含:-
汇总数据框
- 月份
- 3 年来每月 FID 的总频率
- 3 年来每月 FID 的平均频率
- 3 年内每月 FID 的标准差
- 3 年内每月 FID 的标准误差
- 3 年内每月的上下置信水平
我不知道如何通过添加 3 年内每月 FID 的总频率来对数据框进行子集化。例如,在下面的这个例子中,三年一月份的总频率是 - 86 + 66 + 56 = 208,我想每个月都这样做。
###What a section of my data frame looks like ##Desired outcome
Year Month FID Month FID
2018 January 86 January 208
2019 January 66 February 176
2020 January 56
2018 February 76
2019 February 55
2020 February 45
汇总数据框的所需列
Month Total_FID Mean_FID SD_FID S.E_FID Ci_Lower Ci_Upper
我不是高级 R 用户,我已经在线阅读了许多 Stack Overflow 页面和教程,但我无法找出正确的程序(请参阅下面的 R 代码)来生成我想要的汇总数据框。我也找不到使用包 data.table 生成上下置信区间的方法。知道如何使用 data.table 和 dplyr 执行此过程真的很方便,因为我最常使用这两个包。
如果有人能提供帮助,我将不胜感激。
非常感谢
R-code
使用data.table
##Reformat into a data.table object
FID_Table<-data.table(FID)
##Summary statistics
FID.Summarised=FID_Table[, sum(FID),
Month=.N,
Mean_FID=mean(FID),
SD_FID=sd(FID),
S.E = std.error(FID),
by=Month]
##Error message
Error in `[.data.table`(FID_Table, , sum(FID), Month = .N, Mean_FID = mean(FID), :
unused arguments (Month = .N, Mean_FID = mean(FID), SD_FID = sd(FID), S.E = std.error(FID))
使用 dplyr
###Create a vector to ensure the dates are in the right order
library(tidyverse) #loads dplyr and ggplot2
month_levels = c('January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December')
p<-FID.Counts %>%
dplyr::mutate(Month=ordered(Month, levels=month_levels)) %>%
group_by(Month) %>%
dplyr::summarise(Mean.Month=mean(FID),
sd.month = sd(FID, na.rm = TRUE),
n_FID=n(),
sem=sd(FID)/sqrt(n()),
ci_low=Mean.Month-1.96*sem,
ci_hi=Mean.Month+1.96*sem) %>%
ungroup()
##The frequency of FID per month over three years is missing (see FID.Count below)
This code produces the summary statistics but does not subset the
data frame by the frequency counts of FID per month over the last 3
years.
#This table below summed the FID frequencies for each month using
this code using data.table.
library(data.table)
FID.Count=FID_Table[, sum(FID), by=(Month)]
Month V1
1: January 165
2: February 182
3: March 179
4: April 104
5: May 124
6: June 10
7: July 15
8: August 133
9: September 97
10: October 82
11: November 75
12: December 102
R 代码使用 dplyr
生成的汇总统计信息
Month Mean.Month sd.month n_FID sem ci_low ci_hi
1 January 55.000000 10.535654 3 6.082763 43.0777854 66.922215
2 February 60.666667 29.737743 3 17.169094 27.0152431 94.318090
3 March 59.666667 33.291641 3 19.220938 21.9936289 97.339704
4 April 34.666667 16.862186 3 9.735388 15.5853064 53.748027
5 May 41.333333 49.571497 3 28.620117 -14.7620965 97.428763
6 June 3.333333 5.773503 3 3.333333 -3.2000000 9.866667
7 July 5.000000 4.358899 3 2.516611 0.0674415 9.932558
8 August 44.333333 21.007935 3 12.128937 20.5606169 68.106050
9 September 32.333333 21.548395 3 12.440972 7.9490287 56.717638
10 October 27.333333 13.051181 3 7.535103 12.5645314 42.102135
11 November 25.000000 19.000000 3 10.969655 3.4994760 46.500524
12 December 34.000000 4.582576 3 2.645751 28.8143274 39.185673
数据框:'FID'
structure(list(Year = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L), Month = structure(c(5L, 4L, 8L, 1L, 9L,
7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L,
12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L,
10L, 3L), .Label = c("April", "August", "December", "February",
"January", "July", "June", "March", "May", "November", "October",
"September"), class = "factor"), FID = c(65L, 88L, 43L, 54L,
98L, 0L, 0L, 23L, 10L, 15L, 6L, 33L, 56L, 29L, 98L, 23L, 6L,
10L, 7L, 65L, 53L, 41L, 25L, 30L, 44L, 65L, 38L, 27L, 20L, 0L,
8L, 45L, 34L, 26L, 44L, 39L)), class = "data.frame", row.names = c(NA,
-36L))
答案:data.table
##Reformat into a data.table object
FID_Table<-data.table(FID)
##Summarise Data
Summarised.FID<-FID_Table[, .(FID.Freq=sum(FID),
mean = mean(FID),
sd=sd(FID),
median=median(FID)),
by = .(Month)]
##data.table results
Month FID.Freq mean sd median
1: January 165 55.000000 10.535654 56
2: February 182 60.666667 29.737743 65
3: March 179 59.666667 33.291641 43
4: April 104 34.666667 16.862186 27
5: May 124 41.333333 49.571497 20
6: June 10 3.333333 5.773503 0
7: July 15 5.000000 4.358899 7
8: August 133 44.333333 21.007935 45
9: September 97 32.333333 21.548395 34
10: October 82 27.333333 13.051181 26
11: November 75 25.000000 19.000000 25
12: December 102 34.000000 4.582576 33
答案:dplyr
#####Summarise Data###########
###Create a vector to ensure the dates are in the right order
library(tidyverse) #loads dplyr and ggplot2
month_levels = c('January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December')
##Summarise data
p<-FID.Counts %>%
dplyr::mutate(Month=ordered(Month, levels=month_levels)) %>%
group_by(Month) %>%
dplyr::summarise(Frequency = sum(FID),
Mean.Month=mean(FID),
sd.month = sd(FID, na.rm = TRUE),
n_FID=n(),
sem=sd(FID)/sqrt(n()),
ci_low=Mean.Month-1.96*sem,
ci_hi=Mean.Month+1.96*sem) %>%
ungroup()
##dplyr Results
Month Frequency Mean.Month sd.month n_FID sem ci_low ci_hi
1 January 165 55.000000 10.535654 3 6.082763 43.0777854 66.922215
2 February 182 60.666667 29.737743 3 17.169094 27.0152431 94.318090
3 March 179 59.666667 33.291641 3 19.220938 21.9936289 97.339704
4 April 104 34.666667 16.862186 3 9.735388 15.5853064 53.748027
5 May 124 41.333333 49.571497 3 28.620117 -14.7620965 97.428763
6 June 10 3.333333 5.773503 3 3.333333 -3.2000000 9.866667
7 July 15 5.000000 4.358899 3 2.516611 0.0674415 9.932558
8 August 133 44.333333 21.007935 3 12.128937 20.5606169 68.106050
9 September 97 32.333333 21.548395 3 12.440972 7.9490287 56.717638
10 October 82 27.333333 13.051181 3 7.535103 12.5645314 42.102135
11 November 75 25.000000 19.000000 3 10.969655 3.4994760 46.500524
12 December 102 34.000000 4.582576 3 2.645751 28.8143274 39.185673
问题
我有一个名为 'FID' 的数据框(见下文),其中包含三年内每月 FID 的频率。
我想通过使用包 data.table 和 dplyr 计算 3 年内 FID 的总频率 来对我的数据框进行子集化, 然后我想总结我的数据,所以它包含:-
汇总数据框
- 月份
- 3 年来每月 FID 的总频率
- 3 年来每月 FID 的平均频率
- 3 年内每月 FID 的标准差
- 3 年内每月 FID 的标准误差
- 3 年内每月的上下置信水平
我不知道如何通过添加 3 年内每月 FID 的总频率来对数据框进行子集化。例如,在下面的这个例子中,三年一月份的总频率是 - 86 + 66 + 56 = 208,我想每个月都这样做。
###What a section of my data frame looks like ##Desired outcome
Year Month FID Month FID
2018 January 86 January 208
2019 January 66 February 176
2020 January 56
2018 February 76
2019 February 55
2020 February 45
汇总数据框的所需列
Month Total_FID Mean_FID SD_FID S.E_FID Ci_Lower Ci_Upper
我不是高级 R 用户,我已经在线阅读了许多 Stack Overflow 页面和教程,但我无法找出正确的程序(请参阅下面的 R 代码)来生成我想要的汇总数据框。我也找不到使用包 data.table 生成上下置信区间的方法。知道如何使用 data.table 和 dplyr 执行此过程真的很方便,因为我最常使用这两个包。
如果有人能提供帮助,我将不胜感激。
非常感谢
R-code
使用data.table
##Reformat into a data.table object
FID_Table<-data.table(FID)
##Summary statistics
FID.Summarised=FID_Table[, sum(FID),
Month=.N,
Mean_FID=mean(FID),
SD_FID=sd(FID),
S.E = std.error(FID),
by=Month]
##Error message
Error in `[.data.table`(FID_Table, , sum(FID), Month = .N, Mean_FID = mean(FID), :
unused arguments (Month = .N, Mean_FID = mean(FID), SD_FID = sd(FID), S.E = std.error(FID))
使用 dplyr
###Create a vector to ensure the dates are in the right order
library(tidyverse) #loads dplyr and ggplot2
month_levels = c('January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December')
p<-FID.Counts %>%
dplyr::mutate(Month=ordered(Month, levels=month_levels)) %>%
group_by(Month) %>%
dplyr::summarise(Mean.Month=mean(FID),
sd.month = sd(FID, na.rm = TRUE),
n_FID=n(),
sem=sd(FID)/sqrt(n()),
ci_low=Mean.Month-1.96*sem,
ci_hi=Mean.Month+1.96*sem) %>%
ungroup()
##The frequency of FID per month over three years is missing (see FID.Count below)
This code produces the summary statistics but does not subset the
data frame by the frequency counts of FID per month over the last 3
years.
#This table below summed the FID frequencies for each month using
this code using data.table.
library(data.table)
FID.Count=FID_Table[, sum(FID), by=(Month)]
Month V1
1: January 165
2: February 182
3: March 179
4: April 104
5: May 124
6: June 10
7: July 15
8: August 133
9: September 97
10: October 82
11: November 75
12: December 102
R 代码使用 dplyr
生成的汇总统计信息 Month Mean.Month sd.month n_FID sem ci_low ci_hi
1 January 55.000000 10.535654 3 6.082763 43.0777854 66.922215
2 February 60.666667 29.737743 3 17.169094 27.0152431 94.318090
3 March 59.666667 33.291641 3 19.220938 21.9936289 97.339704
4 April 34.666667 16.862186 3 9.735388 15.5853064 53.748027
5 May 41.333333 49.571497 3 28.620117 -14.7620965 97.428763
6 June 3.333333 5.773503 3 3.333333 -3.2000000 9.866667
7 July 5.000000 4.358899 3 2.516611 0.0674415 9.932558
8 August 44.333333 21.007935 3 12.128937 20.5606169 68.106050
9 September 32.333333 21.548395 3 12.440972 7.9490287 56.717638
10 October 27.333333 13.051181 3 7.535103 12.5645314 42.102135
11 November 25.000000 19.000000 3 10.969655 3.4994760 46.500524
12 December 34.000000 4.582576 3 2.645751 28.8143274 39.185673
数据框:'FID'
structure(list(Year = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L), Month = structure(c(5L, 4L, 8L, 1L, 9L,
7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L,
12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L,
10L, 3L), .Label = c("April", "August", "December", "February",
"January", "July", "June", "March", "May", "November", "October",
"September"), class = "factor"), FID = c(65L, 88L, 43L, 54L,
98L, 0L, 0L, 23L, 10L, 15L, 6L, 33L, 56L, 29L, 98L, 23L, 6L,
10L, 7L, 65L, 53L, 41L, 25L, 30L, 44L, 65L, 38L, 27L, 20L, 0L,
8L, 45L, 34L, 26L, 44L, 39L)), class = "data.frame", row.names = c(NA,
-36L))
答案:data.table
##Reformat into a data.table object
FID_Table<-data.table(FID)
##Summarise Data
Summarised.FID<-FID_Table[, .(FID.Freq=sum(FID),
mean = mean(FID),
sd=sd(FID),
median=median(FID)),
by = .(Month)]
##data.table results
Month FID.Freq mean sd median
1: January 165 55.000000 10.535654 56
2: February 182 60.666667 29.737743 65
3: March 179 59.666667 33.291641 43
4: April 104 34.666667 16.862186 27
5: May 124 41.333333 49.571497 20
6: June 10 3.333333 5.773503 0
7: July 15 5.000000 4.358899 7
8: August 133 44.333333 21.007935 45
9: September 97 32.333333 21.548395 34
10: October 82 27.333333 13.051181 26
11: November 75 25.000000 19.000000 25
12: December 102 34.000000 4.582576 33
答案:dplyr
#####Summarise Data###########
###Create a vector to ensure the dates are in the right order
library(tidyverse) #loads dplyr and ggplot2
month_levels = c('January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December')
##Summarise data
p<-FID.Counts %>%
dplyr::mutate(Month=ordered(Month, levels=month_levels)) %>%
group_by(Month) %>%
dplyr::summarise(Frequency = sum(FID),
Mean.Month=mean(FID),
sd.month = sd(FID, na.rm = TRUE),
n_FID=n(),
sem=sd(FID)/sqrt(n()),
ci_low=Mean.Month-1.96*sem,
ci_hi=Mean.Month+1.96*sem) %>%
ungroup()
##dplyr Results
Month Frequency Mean.Month sd.month n_FID sem ci_low ci_hi
1 January 165 55.000000 10.535654 3 6.082763 43.0777854 66.922215
2 February 182 60.666667 29.737743 3 17.169094 27.0152431 94.318090
3 March 179 59.666667 33.291641 3 19.220938 21.9936289 97.339704
4 April 104 34.666667 16.862186 3 9.735388 15.5853064 53.748027
5 May 124 41.333333 49.571497 3 28.620117 -14.7620965 97.428763
6 June 10 3.333333 5.773503 3 3.333333 -3.2000000 9.866667
7 July 15 5.000000 4.358899 3 2.516611 0.0674415 9.932558
8 August 133 44.333333 21.007935 3 12.128937 20.5606169 68.106050
9 September 97 32.333333 21.548395 3 12.440972 7.9490287 56.717638
10 October 82 27.333333 13.051181 3 7.535103 12.5645314 42.102135
11 November 75 25.000000 19.000000 3 10.969655 3.4994760 46.500524
12 December 102 34.000000 4.582576 3 2.645751 28.8143274 39.185673