Return 最接近 R 中给定日期的日期
Return closest date to a given date in R
我的数据框由对个体动物的个体观察组成。每只动物都有一个生日,我想将其关联到日期向量中最近的野外季节日期。
这是一个非常基本的可重现示例:
ID <- c("a", "b", "c", "d", "a") # individual "a" is measured twice here
birthdate <- as.Date(c("2012-06-12", "2014-06-14", "2015-11-11", "2016-09-30", "2012-06-12"))
df <- data.frame(ID, birthdate)
# This is the date vector
season_enddates <- as.Date(c("2011-11-10", "2012-11-28", "2013-11-29", "2014-11-26", "2015-11-16", "2016-11-22", "2012-06-21", "2013-06-23", "2014-06-25", "2015-06-08", "2016-06-14"))
使用以下代码,我可以获得出生日期和最近的季末日期之间的差异。
for(i in 1:length(df$birthdate)){
df$birthseason[i] <- which(abs(season_enddates-df$birthdate[i]) == min(abs(season_enddates-df$birthdate[i])))
}
但是,我想要的是实际日期,而不是差异。例如 birthseason 的第一个值应该是 2012-06-21.
这有点令人困惑,因为您使用了示例中未包含的变量。
但我想这就是你想要的:
for (ii in 1:nrow(df)) df$birthseason[ii] <-as.character(season_enddates[which.min(abs(df$birthdate[ii] - season_enddates))])
或者使用 lapply
:
df$birthseason <- unlist(lapply(df$birthdate,function(x) as.character(season_enddates[which.min(abs(x - season_enddates))])))
结果:
> df
ID birthdate birthseason
1 a 2012-06-12 2012-06-21
2 b 2014-06-14 2014-06-25
3 c 2015-11-11 2015-11-16
4 d 2016-09-30 2016-11-22
5 a 2012-06-12 2012-06-21
我建议对您的问题进行一些修改,以便您的示例代码生成重现您的问题所需的所有变量。请看看我是否理解你的问题。
为了解决它,我建议使用 which.min
(让你的代码更简单和更快),结合你的 season_enddates
向量的子集,如下所示:
for(i in 1:length(younger$HatchCalendarYear)){
df$birthseasonDate[i] <- season_enddates[which.min(abs(season_enddates - df$birthdate[i]))]
}
您正在寻找哪个 season_enddate
最接近 birthdate[1]
和 birthdate[2]
,等等
为了直接获取数据,我将创建一个可重现的实际示例:
birthdate <- as.Date(c("2012-06-12", "2014-06-14",
"2015-11-11", "2016-09-30",
"2012-06-12"))
season_enddates <- as.Date(c("2011-11-10", "2012-11-28",
"2013-11-29", "2014-11-26",
"2015-11-16", "2016-11-22",
"2012-06-21", "2013-06-23",
"2014-06-25", "2015-06-08",
"2016-06-14"))
基本上我使用了你也用过的函数,除了我决定将它分解一下,这样更容易理解你正在尝试做的事情:
new.vector <- rep(0, length(birthdate))
for(i in 1:length(birthdate)){
diffs <- abs(birthdate[i] - season_enddates)
inds <- which.min(diffs)
new.vector[i] <- season_enddates[inds]
}
# new.vector now contains some dates that have been converted to numbers:
as.Date(new.vector, origin = "1970-01-01")
# [1] "2012-06-21" "2014-06-25" "2015-11-16" "2016-11-22"
# [5] "2012-06-21"
这里的所有解决方案本质上都是一样的。如果你想让一个优化的函数为你做这个操作,我会这样做:
match_season <- function(x,y){
nx <- length(x)
ind <- numeric(nx)
for(i in seq_len(nx)){
ind[i] <- which.min(abs(x[i] - y))
}
y[ind]
}
那么你可以简单地做:
younger$birthseason <- match_season(younger$HatchDate, season_enddates)
看起来更干净,并以正确的 Date
格式为您提供所需的输出。
基准测试:
start <- as.Date("1990-07-01")
end <- as.Date("2017-06-30")
birthdate <- sample(seq(start, end, by = "1 day"), 1000)
season_enddates <- seq(as.Date("1990-12-21"),
as.Date("2017-6-21"),
by = "3 months")
library(rbenchmark)
benchmark(match_season(birthdate, season_enddates),
columns = c("test","elapsed"))
给出 100 次复制的时间为 7.62 秒。
findInterval
在这种情况下很有用。为每个 df$birthdate
找到最近的 season_enddates
:
vec = sort(season_enddates)
int = findInterval(df$birthdate, vec, all.inside = TRUE)
int
#[1] 1 5 8 10 1
我们比较间隔的每个周围日期和select最小值的距离:
ans = vec[int]
i = abs(df$birthdate - vec[int]) > abs(df$birthdate - vec[int + 1])
ans[i] = vec[int[i] + 1]
ans
#[1] "2012-06-21" "2014-06-25" "2015-11-16" "2016-11-22" "2012-06-21"
我的数据框由对个体动物的个体观察组成。每只动物都有一个生日,我想将其关联到日期向量中最近的野外季节日期。
这是一个非常基本的可重现示例:
ID <- c("a", "b", "c", "d", "a") # individual "a" is measured twice here
birthdate <- as.Date(c("2012-06-12", "2014-06-14", "2015-11-11", "2016-09-30", "2012-06-12"))
df <- data.frame(ID, birthdate)
# This is the date vector
season_enddates <- as.Date(c("2011-11-10", "2012-11-28", "2013-11-29", "2014-11-26", "2015-11-16", "2016-11-22", "2012-06-21", "2013-06-23", "2014-06-25", "2015-06-08", "2016-06-14"))
使用以下代码,我可以获得出生日期和最近的季末日期之间的差异。
for(i in 1:length(df$birthdate)){
df$birthseason[i] <- which(abs(season_enddates-df$birthdate[i]) == min(abs(season_enddates-df$birthdate[i])))
}
但是,我想要的是实际日期,而不是差异。例如 birthseason 的第一个值应该是 2012-06-21.
这有点令人困惑,因为您使用了示例中未包含的变量。
但我想这就是你想要的:
for (ii in 1:nrow(df)) df$birthseason[ii] <-as.character(season_enddates[which.min(abs(df$birthdate[ii] - season_enddates))])
或者使用 lapply
:
df$birthseason <- unlist(lapply(df$birthdate,function(x) as.character(season_enddates[which.min(abs(x - season_enddates))])))
结果:
> df
ID birthdate birthseason
1 a 2012-06-12 2012-06-21
2 b 2014-06-14 2014-06-25
3 c 2015-11-11 2015-11-16
4 d 2016-09-30 2016-11-22
5 a 2012-06-12 2012-06-21
我建议对您的问题进行一些修改,以便您的示例代码生成重现您的问题所需的所有变量。请看看我是否理解你的问题。
为了解决它,我建议使用 which.min
(让你的代码更简单和更快),结合你的 season_enddates
向量的子集,如下所示:
for(i in 1:length(younger$HatchCalendarYear)){
df$birthseasonDate[i] <- season_enddates[which.min(abs(season_enddates - df$birthdate[i]))]
}
您正在寻找哪个 season_enddate
最接近 birthdate[1]
和 birthdate[2]
,等等
为了直接获取数据,我将创建一个可重现的实际示例:
birthdate <- as.Date(c("2012-06-12", "2014-06-14",
"2015-11-11", "2016-09-30",
"2012-06-12"))
season_enddates <- as.Date(c("2011-11-10", "2012-11-28",
"2013-11-29", "2014-11-26",
"2015-11-16", "2016-11-22",
"2012-06-21", "2013-06-23",
"2014-06-25", "2015-06-08",
"2016-06-14"))
基本上我使用了你也用过的函数,除了我决定将它分解一下,这样更容易理解你正在尝试做的事情:
new.vector <- rep(0, length(birthdate))
for(i in 1:length(birthdate)){
diffs <- abs(birthdate[i] - season_enddates)
inds <- which.min(diffs)
new.vector[i] <- season_enddates[inds]
}
# new.vector now contains some dates that have been converted to numbers:
as.Date(new.vector, origin = "1970-01-01")
# [1] "2012-06-21" "2014-06-25" "2015-11-16" "2016-11-22"
# [5] "2012-06-21"
这里的所有解决方案本质上都是一样的。如果你想让一个优化的函数为你做这个操作,我会这样做:
match_season <- function(x,y){
nx <- length(x)
ind <- numeric(nx)
for(i in seq_len(nx)){
ind[i] <- which.min(abs(x[i] - y))
}
y[ind]
}
那么你可以简单地做:
younger$birthseason <- match_season(younger$HatchDate, season_enddates)
看起来更干净,并以正确的 Date
格式为您提供所需的输出。
基准测试:
start <- as.Date("1990-07-01")
end <- as.Date("2017-06-30")
birthdate <- sample(seq(start, end, by = "1 day"), 1000)
season_enddates <- seq(as.Date("1990-12-21"),
as.Date("2017-6-21"),
by = "3 months")
library(rbenchmark)
benchmark(match_season(birthdate, season_enddates),
columns = c("test","elapsed"))
给出 100 次复制的时间为 7.62 秒。
findInterval
在这种情况下很有用。为每个 df$birthdate
找到最近的 season_enddates
:
vec = sort(season_enddates)
int = findInterval(df$birthdate, vec, all.inside = TRUE)
int
#[1] 1 5 8 10 1
我们比较间隔的每个周围日期和select最小值的距离:
ans = vec[int]
i = abs(df$birthdate - vec[int]) > abs(df$birthdate - vec[int + 1])
ans[i] = vec[int[i] + 1]
ans
#[1] "2012-06-21" "2014-06-25" "2015-11-16" "2016-11-22" "2012-06-21"