选择夏季和冬季的年份

Selecting years with both summer and winter

我有一个跨越多年的夏季和冬季的数据集。我刚刚意识到,当我按冬季和夏季对它们进行子集化时,我最终得到的冬季比夏季多。我认为问题在于我的数据在夏季开始时结束或在冬季结束时结束。

我是否可以设置一个参数,以便我只选择既有夏季又有冬季的年份?

library(lubridate)
library(tidyverse)

date <- rep_len(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),1000)
ID <- rep(seq(1, 5), 100)

df <- data.frame(date = date,
                 x = runif(length(date), min = 60000, max = 80000),
                 y = runif(length(date), min = 800000, max = 900000),
                 ID)

df$month <- month(df$date)
df$year <- year(df$date)

df1 <- df %>%
  mutate(season_categ = case_when(month %in% 6:8 ~ 'summer',
                                  month %in% 1:3 ~ 'winter')) %>%
  group_by(ID, year, season_categ)

summer_list <- df1 %>% 
  group_by(ID, year)%>% 
  filter(season_categ == "summer") %>% 
  group_split()

winter_list <- df1 %>% 
  group_by(ID, year) %>% 
  filter(season_categ == "winter") %>% 
  group_split()

应该这样做:

df1 %>% 
  group_by(year) %>% 
  filter(any(season_categ == "winter") & 
           any(season_categ == "summer"))

为了测试它,我们可以先从 2010 年(例如)中删除冬季月份以获得不完整的年份:

df1 %>% 
  filter(!(year == 2010 & season_categ == "winter")) %>% 
  group_by(year) %>% 
  filter(any(season_categ == "winter") &
           any(season_categ == "summer"))
#> # A tibble: 635 x 7
#> # Groups:   year [2]
#>    date            x       y    ID month  year season_categ
#>    <date>      <dbl>   <dbl> <int> <dbl> <dbl> <chr>       
#>  1 2011-01-01 69169. 880856.     1     1  2011 winter      
#>  2 2011-01-02 62891. 869748.     2     1  2011 winter      
#>  3 2011-01-03 64951. 851220.     3     1  2011 winter      
#>  4 2011-01-04 77424. 844041.     4     1  2011 winter      
#>  5 2011-01-05 75827. 861533.     5     1  2011 winter      
#>  6 2011-01-06 72937. 830014.     1     1  2011 winter      
#>  7 2011-01-07 60130. 830369.     2     1  2011 winter      
#>  8 2011-01-08 79719. 812852.     3     1  2011 winter      
#>  9 2011-01-09 60300. 845120.     4     1  2011 winter      
#> 10 2011-01-10 62817. 879759.     5     1  2011 winter      
#> # … with 625 more rows

这与 df1 %>% filter(year != 2010)(对于我的约会)相同,这意味着它有效。

我试过了。

  • 有两个季节的年份,平均月数(如果按年计算)大于 3 但小于 6(这是您创建该列的方式)。
df1 %>% 
  group_by(ID, year) %>% 
  filter(season_categ == "summer" &
         mean(month[season_categ %in% c('summer', 'winter')]) > 3 & mean(month[season_categ %in% c('summer', 'winter')]) < 6)

这应该会给你想要的结果。但也许问题出在其他地方,比如你可能有不同的月数,例如1,2, 6,7,8 有两个季节,但两个子集中的行数不同。

部分取决于您使用的是气象季节还是天文季节 (https://www.almanac.com/content/first-day-seasons)。或者,如果您想按天而不是按月推进一个季节。这是一个允许您这样做的建议。

seasons <- as.Date(paste0("2021-", c("01-01", "03-01", "06-01", "09-01", "12-01")))
seasons <- as.POSIXlt(seasons)$yday
seasons <- setNames(seasons, c("winter", "spring", "summer", "fall", "winter"))
seasons
# winter spring summer   fall winter 
#      0     59    151    243    334 

library(dplyr)
library(lubridate)
as_tibble(df) %>%
  mutate(
    yday = yday(date),
    season = names(seasons)[findInterval(yday, c(seasons, Inf))]
  ) %>%
  sample_n(10)
# # A tibble: 10 x 6
#    date            x       y    ID  yday season
#    <date>      <dbl>   <dbl> <int> <int> <chr> 
#  1 2010-02-13 79471. 862415.     4    43 winter
#  2 2010-10-23 61796. 813429.     1   295 fall  
#  3 2011-12-09 65958. 808064.     3   342 winter
#  4 2010-05-29 65872. 841309.     4   148 spring
#  5 2010-03-07 63789. 869548.     1    65 spring
#  6 2012-02-12 67605. 859081.     3    42 winter
#  7 2011-04-19 79034. 883803.     4   108 spring
#  8 2011-03-16 69658. 832297.     5    74 spring
#  9 2011-12-29 68793. 881267.     3   362 winter
# 10 2012-04-24 70784. 805323.     5   114 spring

从这里开始,让我们筛选至少有两个季节之一的年份。不幸的是,我们的数据分布相当均匀,

as_tibble(df) %>%
  mutate(
    year = year(date),
    yday = yday(date),
    season = names(seasons)[findInterval(yday, c(seasons, Inf))],
    ) %>%
  group_by(year) %>%
  count(season) %>%
  tidyr::pivot_wider(year, names_from = season, values_from = n)
# # A tibble: 3 x 5
# # Groups:   year [3]
#    year  fall spring summer winter
#   <dbl> <int>  <int>  <int>  <int>
# 1  2010    91     92     92     90
# 2  2011    91     92     92     90
# 3  2012    28     92     92     58

所以我们不知道它是否真的在做我们想要的。我会人为地删除一些数据来测试我们想要的逻辑:

as_tibble(df) %>%
  mutate(
    year = year(date),
    yday = yday(date),
    season = names(seasons)[findInterval(yday, c(seasons, Inf))],
    ) %>%
  filter(year > 2010 | season %in% c("fall", "winter")) %>%  # artificial, for testing
  group_by(year) %>%
  count(season) %>%
  tidyr::pivot_wider(year, names_from = season, values_from = n)
# # A tibble: 3 x 5
# # Groups:   year [3]
#    year  fall winter spring summer
#   <dbl> <int>  <int>  <int>  <int>
# 1  2010    91     90     NA     NA
# 2  2011    91     90     92     92
# 3  2012    28     58     92     92

从这里开始,我们添加一个分组过滤器:

as_tibble(df) %>%
  mutate(
    year = year(date),
    yday = yday(date),
    season = names(seasons)[findInterval(yday, c(seasons, Inf))],
    ) %>%
  filter(year > 2010 | season %in% c("fall", "winter")) %>%  # artificial, for testing
  group_by(year) %>%
  filter(all(c("winter", "summer") %in% season)) %>%         # this is the new line
  sample_n(10)
# # A tibble: 20 x 7
# # Groups:   year [2]
#    date            x       y    ID  year  yday season
#    <date>      <dbl>   <dbl> <int> <dbl> <dbl> <chr> 
#  1 2011-02-13 61686. 815664.     4  2011    44 winter
#  2 2011-10-23 75448. 849477.     1  2011   296 fall  
#  3 2011-07-15 75901. 840969.     1  2011   196 summer
#  4 2011-05-29 66108. 811565.     4  2011   149 spring
#  5 2011-03-07 70298. 831304.     1  2011    66 spring
#  6 2011-09-18 73951. 875712.     1  2011   261 fall  
#  7 2011-08-04 64917. 860239.     1  2011   216 summer
#  8 2011-11-29 78909. 802692.     3  2011   333 fall  
#  9 2011-01-07 66441. 868062.     2  2011     7 winter
# 10 2011-06-16 64583. 889124.     2  2011   167 summer
# 11 2012-05-09 78725. 862934.     5  2012   130 spring
# 12 2012-08-12 67767. 871229.     5  2012   225 summer
# 13 2012-06-28 62354. 898829.     5  2012   180 summer
# 14 2012-05-26 62373. 819059.     2  2012   147 spring
# 15 2012-06-21 68019. 896370.     3  2012   173 summer
# 16 2012-01-22 61753. 872778.     2  2012    22 winter
# 17 2012-03-18 64490. 810292.     3  2012    78 spring
# 18 2012-08-15 76048. 875765.     3  2012   228 summer
# 19 2012-09-15 65386. 885431.     4  2012   259 fall  
# 20 2012-04-19 60072. 895292.     5  2012   110 spring

(2010 年没有,我们期望的。)


数据

我在随机数据中使用了可重复性种子:

set.seed(42)
date <- rep_len(seq(dmy("01-01-2010"), dmy("31-12-2013"), by = "days"),1000)
ID <- rep(seq(1, 5), 100)

df <- data.frame(date = date,
                 x = runif(length(date), min = 60000, max = 80000),
                 y = runif(length(date), min = 800000, max = 900000),
                 ID)

head(df)
#         date          x          y ID
# 1 2010-01-01 78296.1209 884829.322  1
# 2 2010-01-02 78741.5083 806274.633  2
# 3 2010-01-03 65722.7907 881984.509  3
# 4 2010-01-04 76608.9525 853936.029  4
# 5 2010-01-05 72834.9104 849902.010  5
# 6 2010-01-06 70381.9190 802222.732  1