计算R中融雪的开始日期

Question

我有从 1950 年到 2017 年这种格式的每日温度 Data

我需要计算融雪开始日期，它被定义为日气温高于 0 摄氏度的第一天，紧随三月和五月的最后五天期间，当日气温低于 0 C。 目前我的代码：

  df1<-read.csv("temp.csv")
  require(dplyr)
  # applying the condition to check each temperature value
  df1$boolean<- ifelse(df1$temp<0.0 , 1, 0)

  #computing the total sum < 0 and the start and end date
  snow<-df1 %>%
  mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
  group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
   filter(boolean == 1 & n() > 1) %>%
   summarize("Start Date"=min(as.character(date)),
        "End Date"=max(as.character(date)),
        "Length of Run"=n()) %>%
   ungroup() %>%
  select(-matches("group"))
colnames(snow)[3] <- 'length'

# subset length that greater >5
obs<-subset(snow,length >=5)

上面的代码给了我部分解决方案（如果进一步手动编辑我将得到符合我定义的理想解决方案）我只对每年的一个开始日期感兴趣。关于如何编辑此代码以根据上述定义计算起始日期，我需要一些进一步的指导。

我有很多位置，因此手动编辑这不是理想的解决方案。您的帮助将不胜感激

Answer 1

rle() 救援！

library(broom)
library(tidyverse)

temp <- read_csv("temp.csv")

在阅读此辅助函数之前，最好先阅读下面的管道。

每年我们：

采用运行长度编码 above/below 0
第一个为真 (<0) 且连续 5 天以上的是我们的候选人
取下一个索引
如果太多（没有符合条件的天数）return NA
else return 那一天

因此：

mk_runs <- function(xdf) {

  r <- rle(xdf$below_0) take the T/F RLE
  pos <- which(r$values & r$length>=5)[1] # find the first one meeting criteria
  idx <- (sum(r$lengths[1:pos]))+1 # sum the lengths up until this point and add 1 to get to the first > 0 day

  if (idx > nrow(xdf)) { # if past our date range return NA
    data_frame(year=xdf$year[1], date=NA)
  } else {
    xdf[idx, c("year", "date")]
  }

}

我们需要整理数据：

separate(temp, Date, c("month", "day", "year")) %>%
  mutate_all(as.numeric) %>% 
  mutate(year = ifelse(year >=50, 1900+year, 2000+year)) %>% 
  mutate(date = as.Date(sprintf("%04d-%02d-%02d", year, month, day))) %>% 
  mutate(month = lubridate::month(date)) %>% 
  mutate(below_0 = temp < 0) %>% 
  filter(month >= 3 & month <=5) %>% 
  group_by(year) %>% # year groups
  arrange(date) %>%  # in order
  do(mk_runs(.)) %>% # see above function
  print(n=21)
## # A tibble: 21 x 2
## # Groups:   year [21]
##     year       date
##    <dbl>     <date>
##  1  1950 1950-04-30
##  2  1951 1951-05-21
##  3  1952 1952-05-28
##  4  1953 1953-05-15
##  5  1954 1954-05-28
##  6  1955 1955-05-14
##  7  1956 1956-05-02
##  8  1957 1957-05-07
##  9  1958 1958-04-27
## 10  1959         NA
## 11  1960 1960-04-24
## 12  1961 1961-05-16
## 13  1962 1962-05-19
## 14  1963 1963-05-13
## 15  1964 1964-05-20
## 16  1965 1965-05-20
## 17  1966 1966-05-07
## 18  1967 1967-04-27
## 19  1968 1968-05-10
## 20  1969 1969-05-22
## 21  1970 1970-05-21

Answer 2

这是另一种尝试。在我的第一步中，我首先创建了两个新列（即年和月）。然后，我过滤了三月到五月之间的数据。然后，我为温度高于 0 摄氏度的行创建了索引号。这个过程每年进行一次。由于在温度高于零的那些日子之前需要连续五天，因此需要忽略等于/小于 5 的索引号。这是在外部 if_else().

的真实条件下完成的 if_else()

在我的第二步中，我选择使用由splitstackshape的作者开发的名为SOfun的包。您可以从 github 下载此包。 getMyRows() 正在做的是； 1) 它通过指定 pattern 来标识应考虑哪些行，2) 从 1) 中标记的行中获取特定范围的行，以及 3) 创建一个列表。这里 range = -5:0 表示我正在选择目标行的前五行，以及目标行本身。

在我的第三步中，我用两个逻辑条件对 mylist 进行了子集化。 !is.na(x$ind[6]) 检查 ind 的第 6 个元素是否不为 NA，all(x$temp[1:5] < 0) 检查 temp（温度）的第 1-5 个元素是否都小于零。 Filter() 选择满足两个逻辑条件的列表元素。然后，我从每个数据框中提取第 6 行，因为那是目标行。我绑定列表，按年份对数据进行分组，并使用 slice() 选择每年的第一个观察值。

library(devtools)

install_github("mrdwab/overflow-mrdwab")
install_github("mrdwab/SOfun")

library(overflow)
library(SOfun)
library(readxl)
library(dplyr)

# Part 1
mydf <- read_excel("temp.xlsx") %>%
        mutate(year = as.numeric(format(Date, "%Y")),
               month = as.numeric(format(Date, "%m"))) %>%
        filter(between(month, 3, 5)) %>%
        group_by(year) %>%
        mutate(ind = if_else(temp > 0, 
                     {ind <- row_number()
                      if_else(ind <= 5, NA_integer_, ind)},
                      NA_integer_)) %>%
        ungroup

# Part 2
mylist <- getMyRows(mydf,
                    pattern = which(complete.cases(mydf$ind)),
                    range = -5:0, isNumeric = TRUE)

# Part 3
Filter(function(x) !is.na(x$ind[6]) & all(x$temp[1:5] < 0), mylist) %>%
lapply(function(x) x[6, ]) %>%
bind_rows %>%
group_by(year) %>%
slice(1) %>%
select(Date)

    year Date               
   <dbl> <dttm>             
 1  1950 1950-04-30 00:00:00
 2  1951 1951-05-21 00:00:00
 3  1952 1952-05-28 00:00:00
 4  1953 1953-05-15 00:00:00
 5  1954 1954-05-28 00:00:00
 6  1955 1955-05-14 00:00:00
 7  1956 1956-05-02 00:00:00
 8  1957 1957-05-07 00:00:00
 9  1958 1958-04-27 00:00:00
10  1960 1960-04-24 00:00:00
11  1961 1961-05-16 00:00:00
12  1962 1962-05-19 00:00:00
13  1963 1963-05-13 00:00:00
14  1964 1964-05-20 00:00:00
15  1965 1965-05-20 00:00:00
16  1966 1966-05-07 00:00:00
17  1967 1967-04-27 00:00:00
18  1968 1968-05-10 00:00:00
19  1969 1969-05-22 00:00:00
20  1970 1970-05-21 00:00:00

Answer 3

我们在 (1) 中假设融化日必须发生在三月、四月或五月，而在 (2) 中只有 5 个零下天出现在三月、四月、五月，但融化日可能发生在六月, 说.

1) 定义 df2，它是 df1 加上附加列：月、年和代码，如果日期不在三月、四月、五月，代码为 0，否则为 1如果温度 < 0 和 2 如果温度 >= 0。

如果最近 6 个日期的代码为 1、1、1、1、1、2，则现在使用 df2 运行 rollapplyr 代码返回 TRUE，否则返回 FALSE。取 TRUE 行，每年只保留最后一行。将其右连接到所有年份的数据框，以便在任何缺失年份的输出中生成 NA。

library(zoo)

df2 <- df1 %>%
    mutate(Date = as.Date(Date), month = as.numeric(format(Date, "%m")), 
           year = as.numeric(format(Date, "%Y")),
           code = (month %in% 3:5) * ((temp < 0) + 2 * (temp >= 0)),
           OK = rollapplyr(code, 6, identical, c(1, 1, 1, 1, 1, 2), fill = FALSE))

df2 %>%
       filter(OK) %>%
       filter(!duplicated(year, fromLast = TRUE)) %>%
       right_join(unique(df2["year"]), by = "year") %>%
       select(year, Date)

给予：

   year       Date
1  1950 1950-05-24
2  1951 1951-05-21
3  1952 1952-05-28
4  1953 1953-05-15
5  1954 1954-05-28
6  1955 1955-05-14
7  1956 1956-05-27
8  1957 1957-05-17
9  1958 1958-05-21
10 1959       <NA>
11 1960 1960-05-26
12 1961 1961-05-16
13 1962 1962-05-19
14 1963 1963-05-13
15 1964 1964-05-27
16 1965 1965-05-20
17 1966 1966-05-26
18 1967 1967-05-26
19 1968 1968-05-27
20 1969 1969-05-30
21 1970 1970-05-21

2) 在 (1) 中，我们假设融化开始日必须在三月、四月或五月，但这里我们假设只有零度以下的天数在该范围内，并且融化开始的日子可能会延长。

计算与 (1) 中的相同，只是代码现在是这样的：1 表示 3 月、4 月或 5 月的零度以下温度，2 表示任何时间高于零的温度（不仅仅是在 3 月、4 月和可能）和 0 是其他任何东西。我们将代码折叠成一个字符串（每个日期一个字符）并在其上使用正则表达式来查找 5 个后跟任何内容的子字符串，直到我们到达下一个 2。我们像 (1) 中一样处理其余部分，除了现在我们不需要连接，因为总会有融化开始的一天。如果没有连接，我们现在可以将其表示为单个管道。

df1 %>%
    mutate(Date = as.Date(Date), month = as.numeric(format(Date, "%m")), 
           year = as.numeric(format(Date, "%Y")),
           code = (month %in% 3:5) * (temp < 0) + 2 * (temp >= 0),
           OK = { g <- gregexpr("1{5}.*?2", paste(code, collapse = ""))[[1]]
                  seq_along(code) %in% (g + attr(g, "match.length") - 1) }) %>%
    filter(OK) %>%
    filter(!duplicated(year, fromLast = TRUE)) %>%
    select(year, Date)

给予：

   year       Date
1  1950 1950-05-24
2  1951 1951-06-01
3  1952 1952-05-28
4  1953 1953-05-15
5  1954 1954-05-28
6  1955 1955-05-14
7  1956 1956-05-27
8  1957 1957-05-17
9  1958 1958-05-21
10 1959 1959-06-02
11 1960 1960-05-26
12 1961 1961-05-16
13 1962 1962-05-19
14 1963 1963-06-01
15 1964 1964-05-27
16 1965 1965-05-20
17 1966 1966-05-26
18 1967 1967-05-26
19 1968 1968-05-27
20 1969 1969-05-30
21 1970 1970-05-21

Answer 4

tidyverse 中的直接解决方案。

library(tidyverse)
library(lubridate)


readxl::read_excel("temp.xlsx") -> df1

df1 %>%
  mutate(year = year(Date),
         month = month(Date)) %>%
  group_by(year) %>%
  mutate(
    below_0 = as.numeric(temp < 0),
    streak5 = cumsum(below_0) - cumsum(lag(below_0, 5, 0)),
    onset = month %in% c(3, 4, 5) & lag(streak5) == 5 & below_0 == 0) %>% 
  filter(onset) %>%
  summarise(Date = last(Date))

给予

# A tibble: 20 x 2
    year       Date
   <dbl>     <dttm>
 1  1950 1950-05-24
 2  1951 1951-05-21
 3  1952 1952-05-28
 4  1953 1953-05-15
 5  1954 1954-05-28
 6  1955 1955-05-14
 7  1956 1956-05-27
 8  1957 1957-05-17
 9  1958 1958-05-21
10  1960 1960-05-26
11  1961 1961-05-16
12  1962 1962-05-19
13  1963 1963-05-13
14  1964 1964-05-27
15  1965 1965-05-20
16  1966 1966-05-26
17  1967 1967-05-26
18  1968 1968-05-27
19  1969 1969-05-30
20  1970 1970-05-21

我希望代码或多或少地解释了自己，streak5 是前几天温度低于 0 的天数，onset 实现了问题中给出的标准，summarise 选择给定年份的最后一个日期。

计算R中融雪的开始日期

computing onset date of snowmelt in R

r

threshold

dplyr