如何每月汇总每日数据,使用 dplyr 和 lubridate,只有每月少于 10 天是 NA?

How to monthly summarise daily data, using dplyr and lubridate, only if less than 10 days per month are NAs?

我有从 1955 年到 2017 年不同地点的每日气象数据(温度和降水量),我想将每个变量汇总为月平均值,但前提是每个月的 NA 数量 小于 10.

我以四个月的温度数据为例(第1个月:1 NA,第2个月(31天):30 NA,第3个月:0 NA,第4个月:所有数据为NA):

library(dplyr)
library(lubridate)    
exmpldf <- data.frame(DATE = c("1955-06-01", "1955-06-02", "1955-06-03", "1955-06-04", "1955-06-05", "1955-06-06", "1955-06-07", "1955-06-08", "1955-06-09", "1955-06-10", 
                                    "1955-06-11", "1955-06-12", "1955-06-13", "1955-06-14", "1955-06-15", "1955-06-16", "1955-06-17", "1955-06-18", "1955-06-19", "1955-06-20", 
                                    "1955-06-21", "1955-06-22", "1955-06-23", "1955-06-24", "1955-06-25", "1955-06-26", "1955-06-27", "1955-06-28", "1955-06-29", "1955-06-30", 
                                    "1955-07-01", "1955-07-02", "1955-07-03", "1955-07-04", "1955-07-05", "1955-07-06", "1955-07-07", "1955-07-08", "1955-07-09", "1955-07-10", 
                                    "1955-07-11", "1955-07-12", "1955-07-13", "1955-07-14", "1955-07-15", "1955-07-16", "1955-07-17", "1955-07-18", "1955-07-19", "1955-07-20", 
                                    "1955-07-21", "1955-07-22", "1955-07-23", "1955-07-24", "1955-07-25", "1955-07-26", "1955-07-27", "1955-07-28", "1955-07-29", "1955-07-30", 
                                    "1955-07-31", "1955-08-01", "1955-08-02", "1955-08-03", "1955-08-04", "1955-08-05", "1955-08-06", "1955-08-07", "1955-08-08", "1955-08-09", 
                                    "1955-08-10", "1955-08-11", "1955-08-12", "1955-08-13", "1955-08-14", "1955-08-15", "1955-08-16", "1955-08-17", "1955-08-18", "1955-08-19", 
                                    "1955-08-20", "1955-08-21", "1955-08-22", "1955-08-23", "1955-08-24", "1955-08-25", "1955-08-26", "1955-08-27", "1955-08-28", "1955-08-29", 
                                    "1955-08-30", "1955-08-31", "1955-09-01", "1955-09-02", "1955-09-03", "1955-09-04", "1955-09-05", "1955-09-06", "1955-09-07", "1955-09-08", 
                                    "1955-09-09", "1955-09-10", "1955-09-11", "1955-09-12", "1955-09-13", "1955-09-14", "1955-09-15", "1955-09-16", "1955-09-17", "1955-09-18", 
                                    "1955-09-19", "1955-09-20", "1955-09-21", "1955-09-22", "1955-09-23", "1955-09-24", "1955-09-25", "1955-09-26", "1955-09-27", "1955-09-28", 
                                    "1955-09-29", "1955-09-30"), 
                          TMAX = c(NA, 20, 27, 17,  26.5, 27, 17, 26.5, 20, 23, 23, 21.5, 24, 26.5, 27, 27, 26.5, 24.5, 23, 22.5, 24, 23, 21.5, 25, 26.5, 23, 
                           24, 23.5, 23, 23, 23, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
                           NA, 24, 22, 21, 17, 17, 17, 21.5, 22, 22, 22.5, 22.5, 16.5, 20.5, 17.5, 23, 17, 21, 21.5, 21, 21, 20, 22, 22, 22, 21.5, 21.5, 21.5, 22.5, 20, 
                           21, 20, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA))

对于每月汇总,我使用 mutate 创建了一个列“MONTH”和一个列“YEAR”

exmpldf <- exmpldf %>%
  mutate(month(DATE), year(DATE))
names(exmpldf) <- c("DATE", "TMAX", "MONTH", "YEAR")

要创建我使用的月平均值 summarize:

exmpldfmeanMonth <- exmpldf %>%
  group_by(MONTH, YEAR) %>%
  summarise(TMAX = mean(TMAX))

问题是,在我的时间序列(1955-2017)中,有许多月份至少有 1 个每日数据为 NA,而其他月份的所有或几乎所有每日数据为 NA,无论如何,每月均值为 NA:

> exmpldfmeanMonth
# A tibble: 4 x 3
# Groups:   MONTH [4]
  MONTH  YEAR  TMAX
  <dbl> <dbl> <dbl>
1     6  1955  NA   (1 day is NA)
2     7  1955  NA   (all days but 1, are NA)
3     8  1955  20.7 (no NAs)
4     9  1955  NA   (all days are NA)

您可以添加 na.rm = T,但即使每个月只有一个数据,它也会计算平均值:

exmpldfmeanMonth <- exmpldf %>%
  group_by(MONTH, YEAR) %>%
  summarise(TMAX = mean(TMAX, na.rm = T))

> exmpldfmeanMonth
# A tibble: 4 x 3
# Groups:   MONTH [4]
  MONTH  YEAR  TMAX
  <dbl> <dbl> <dbl>
1     6  1955  23.7  (1 day is NA)
2     7  1955  23    (all days but 1, are NA)
3     8  1955  20.7  (no NAs)
4     9  1955 NaN    (all days are NA)

所以我想生成一个条件,仅当每月有10个或更少的NA时才计算月平均值,否则应视为NA:

> exmpldfmeanMonth
# A tibble: 4 x 3
# Groups:   MONTH [4]
  MONTH  YEAR  TMAX
  <dbl> <dbl> <dbl>
1     6  1955  23.7  (1 day is NA)
2     7  1955 NAN    (all days but 1, are NA)
3     8  1955  20.7  (no NAs)
4     9  1955 NaN    (all days are NA)

你能指导我如何解决这个问题吗? 非常感谢您!

library(dplyr)
library(lubridate)

df %>% 
  mutate(month = month(DATE),
         year = year(DATE)) %>% 
  group_by(month, year) %>% 
  summarize(prcp = if (sum(is.na(TMAX)) <= 10) mean(TMAX, na.rm = T) else NA,
            .groups = "drop")

或者,当您 summarize 时,您可以计算 NA 的数量,然后添加 mutate 语句以有条件地更改 prcp:

df %>% 
  mutate(month = month(DATE),
         year = year(DATE)) %>% 
  group_by(month, year) %>% 
  summarize(prcp = mean(TMAX, na.rm = T),
            numna = sum(is.na(TMAX)), # count number of NA
            .groups = "drop") %>% 
  mutate(prcp = ifelse(numna > 10, NA, prcp)) %>% 
  select(-numna)

输出

在您显示的数据中,只有一个 monthyear 组合,并且该分组有超过 10 个 NA:

  month  year prcp 
1     6  1955 NA   

更新

鉴于您已使用新数据更新了 reprex,此解决方案仍然有效:

str(exmpldf)
'data.frame':   122 obs. of  2 variables:
 $ DATE: chr  "1955-06-01" "1955-06-02" "1955-06-03" "1955-06-04" ...
 $ TMAX: num  NA 20 27 17 26.5 27 17 26.5 20 23 ...

exmpldf %>% 
  mutate(month = month(DATE),
         year = year(DATE)) %>% 
  group_by(month, year) %>% 
  summarize(prcp = if (sum(is.na(TMAX)) <= 10) mean(TMAX, na.rm = T) else NA,
            .groups = "drop")

  month  year  prcp
  <dbl> <dbl> <dbl>
1     6  1955  23.7
2     7  1955  NA  
3     8  1955  20.7
4     9  1955  NA  

请根据您的方法使用包 data.tablelubridate:

找到一个替代方案

Reprex

  • 代码
library(data.table)
library(lubridate)

setDT(df1)[, DATE := ymd(DATE)
           ][, `:=` (month = month(DATE), year = year(DATE))
             ][, .(PRCP = fifelse(sum(is.na(TMAX)) <= 10, mean(TMAX, na.rm = TRUE), NA_real_)), by = .(month, year)][]
  • 案例 1:NA <= 10

1.1 您的数据:

df1 <- data.frame(DATE = c("1955-06-01", "1955-06-02", "1955-06-03", "1955-06-04",
                           "1955-06-05", "1955-06-06", "1955-06-07", "1955-06-08",
                           "1955-06-09", "1955-06-10", "1955-06-11", "1955-06-12",
                           "1955-06-13", "1955-06-14", "1955-06-15", "1955-06-16"),
                  TMAX = c(NA, NA, NA, NA, NA, NA, NA, NA, 20, 23, 23, 21.5, 24, 26.5,
                           27, 27))

2.2输出:

#>    month year PRCP
#> 1:     6 1955   24
  • 情况 2:NA > 10

2.1 您的数据:

df1 <- data.frame(DATE = c("1955-06-01", "1955-06-02", "1955-06-03", "1955-06-04",
                           "1955-06-05", "1955-06-06", "1955-06-07", "1955-06-08",
                           "1955-06-09", "1955-06-10", "1955-06-11", "1955-06-12",
                           "1955-06-13", "1955-06-14", "1955-06-15", "1955-06-16"),
                  TMAX = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 21.5, 24, 26.5,
                           27, 27))

2.2输出:

#>    month year PRCP
#> 1:     6 1955   NA

reprex package (v0.3.0)

于 2021-10-28 创建

考虑创建一个帮助功能,您可以根据需要自定义该功能。此外,您可以指定是否要使用 meansum 或任何其他聚合。

agg_data<- function(x, n=10, f = 'avg'){
#' @param x a vector of values
#' @param n a minimum number of observations
#' @param f which function to apply (e.g. `avg`, `sum`)
  
  # return NA if there are more than 10 NA
  if( sum(is.na(x)) > n ) return( NA_real_ )
  
  x <- dplyr::case_when(
    f %in% 'avg' ~ mean(x, na.rm = TRUE),
    f %in% 'sum' ~ sum(x, na.rm = TRUE),
    TRUE ~ NA_real_
  )
  
  return( x )
}

然后你可以在你的summarise脚本中使用这个函数,例如

exmpldf %>% 
  mutate(month = month(DATE),
         year = year(DATE)) %>% 
  group_by(month, year) %>% 
  summarise(prcp = agg_data(TMAX, n = 10, f = 'avg'),
            .groups = "drop")