根据两个分组条件计算中位日期

Question

我有以下数据框：

> head(df)
# A tibble: 6 x 6
# Groups:   lat, decade [2]
    lat  long date       year  decade    month_day
  <dbl> <dbl> <date>     <chr> <chr>     <chr>    
1    55    18 1952-02-03 1952  1950-1959 02-03    
2    55    18 1958-02-08 1958  1950-1959 02-08    
3    55    18 1958-02-08 1958  1950-1959 02-08    
4    55    18 1958-02-08 1958  1950-1959 02-08    
5    55    18 1965-02-07 1965  1960-1969 02-07    
6    55    18 1966-03-03 1966  1960-1969 03-03    
   


> summary(df)
          lat             long         date                year              decade         
     Min.   :55.00   Min.   :18   Min.   :1951-03-22   Length:1414        Length:1414       
     1st Qu.:56.00   1st Qu.:18   1st Qu.:1987-01-01   Class :character   Class :character  
     Median :58.00   Median :18   Median :2004-04-02   Mode  :character   Mode  :character  
     Mean   :59.07   Mean   :18   Mean   :1999-02-16                                        
     3rd Qu.:62.00   3rd Qu.:18   3rd Qu.:2014-01-01                                        
     Max.   :68.00   Max.   :18   Max.   :2021-03-28                                        
      month_day        
     Length:1414       
     Class :character  
     Mode  :character

我想根据纬度 (lat) 和根据 decade

获得中位数 month_day

我试过了，但无法通过错误：

df = df %>%
  group_by(lat, decade) %>%
  summarise(across(month_day, median)) %>%
  ungroup

Error in `summarise()`:
! Problem while computing `..1 = across(month_day, median)`.
Caused by error:
! `month_day` must return compatible vectors across groups.
i Result type for group 1 (lat = 55, decade = "1950-1959"): <double>.
i Result type for group 2 (lat = 55, decade = "1960-1969"): <character>.

不知道怎么解决，非常感谢您的帮助。

编辑：

> ds_filtered_median[ds_filtered_median$lat == '57', ]
# A tibble: 124 x 6
     lat  long date       year  decade    month_day
   <dbl> <dbl> <date>     <chr> <chr>     <chr>    
 1    57    18 1955-04-08 1955  1950-1959 04-08    
 2    57    18 1957-02-19 1957  1950-1959 02-19    
 3    57    18 1958-04-06 1958  1950-1959 04-06    
 4    57    18 1959-01-01 1959  1950-1959 01-01    
 5    57    18 1960-01-03 1960  1960-1969 01-03    
 6    57    18 1961-01-02 1961  1960-1969 01-02    
 7    57    18 1962-01-02 1962  1960-1969 01-02    
 8    57    18 1963-01-01 1963  1960-1969 01-01    
 9    57    18 1964-01-19 1964  1960-1969 01-19    
10    57    18 1965-01-12 1965  1960-1969 01-12    
# ... with 114 more rows

Answer 1

您必须将 month_day 转换为数值才能获得中位数。 across 只有在单独为多个列计算某些内容时才需要，例如使用 data %>% summarise(across(any_of(c("lat", "long")), median))

获得中位数 lon 和 lat

library(tidyverse)

data <- tribble(
  ~lat, ~long, ~date, ~year, ~decade, ~month_day,
  55, 18, "1952-02-03", 1952, "1950-1959", "02-03",
  55, 18, "1958-02-08", 1958, "1950-1959", "02-08",
  55, 18, "1958-02-08", 1958, "1950-1959", "02-08",
  55, 18, "1958-02-08", 1958, "1950-1959", "02-08",
  55, 18, "1965-02-07", 1965, "1960-1969", "02-07",
  55, 18, "1966-03-03", 1966, "1960-1969", "03-03"
)

data %>%
  mutate(
    month_day_num = month_day %>% str_extract("[0-9]+$") %>% as.numeric()
  ) %>%
  group_by(lat, decade) %>%
  summarise(
    median_month_day = median(month_day_num)
  )
#> `summarise()` has grouped output by 'lat'. You can override using the `.groups`
#> argument.
#> # A tibble: 2 × 3
#> # Groups:   lat [1]
#>     lat decade    median_month_day
#>   <dbl> <chr>                <dbl>
#> 1    55 1950-1959                8
#> 2    55 1960-1969                5

^{由 reprex package (v2.0.0)}

于 2022-04-05 创建

Answer 2

您可以将日期转换为自年初以来的天数。从这个数字你可以很容易地计算出你的中位数。然后将您的日期转换为一月的任何一天作为参考。不过，您可以给我一个闰年...对于日期操作，我使用了 lubridate。

library(lubridate)

data %>%
  mutate(
    date = ymd(date),
    days_since_january = as.numeric(date - ymd(paste(year(date), 1, 1, sep = "-")))
  ) %>%
  group_by(lat, decade) %>%
  summarise(across(days_since_january, median), .groups = "keep") %>%
  mutate(median_month_date = format(ymd("1960-01-01") + days(floor(days_since_january)), "%m-%d"))


# A tibble: 2 x 4
# Groups:   lat, decade [2]
    lat decade    days_since_january median_month_date
  <dbl> <chr>                  <dbl> <chr>            
1    55 1950-1959                 38 02-08            
2    55 1960-1969                 49 02-19  

# A tibble: 2 x 4
# Groups:   lat, decade [2]
    lat decade    days_since_january median_month_date
  <int> <chr>                  <dbl> <chr>            
1    57 1950-1959               72   03-13            
2    57 1960-1969                1.5 01-02

根据两个分组条件计算中位日期

Calculate the median date based on two grouping conditions

group-by

r

median

dplyr

summarize