为同一 id 取两个数据框之间的最小日期差异
Taking minimum difference of dates between two data frame for the same id
我的问题很简单。我有 2 个数据框,每个数据框都有一列日期 (%Y-%m-%d) 和一列 ID。一个每行只有一个 ID,另一个具有相同 ID 的多行。我想获取该值,以便它显示日期的最小差异。现在我用一个例子更好地解释:
df1(colA 的单个值):
+-------+------------+------+------+-------+-------+
| colA | colB | colC | colD | colE | colF |
+-------+------------+------+------+-------+-------+
| 3000 | 2011-01-20 | 2 | 3.43 | 2.01 | 1.63 |
| 3001 | 2012-04-06 | 1 | 1.12 | -0.63 | -1.16 |
| 3002 | 2012-04-24 | 2 | 2.28 | -0.18 | -0.12 |
| 3003 | 2012-04-13 | 2 | 1.27 | -0.51 | -0.82 |
| 3004 | 2011-08-24 | 5 | 5.30 | 2.68 | 2.10 |
| 3006 | 2011-08-02 | 2 | 2.12 | -0.27 | -2.60 |
+-------+------------+------+------+-------+-------+
df2(第一列 (X) 的多个值):
+------+---------------+----------+
| colX | colY | colZ |
+------+---------------+----------+
| 3000 | 2011-02-01 | 0 |
| 3000 | 2012-03-01 | 0 |
| 3000 | 2013-02-01 | 0 |
| 3000 | 2014-03-01 | 1 |
| 3000 | 2015-03-01 | 0 |
| 3000 | 2016-04-01 | 0 |
| 3002 | 2011-03-01 | 1 |
| 3002 | 2011-08-01 | 1 |
| 3002 | 2012-04-01 | 0 |
+------+---------------+----------+
在这种情况下,我看到 colA (df1) 中的第一个值,并计算 2011-01-20 与 df2 中 3000 的所有日期(2011-02-01、2012-03)之间的所有月份差异-01,ecc),所以前 6 行。我只取最小差值,因此在本例中是第一个差值 (2011-02-01),差值将近一个月。所以最后我应该让 df1 有 3 个新列(Y 和 Z 和差异)所以 df2 上的最小日期,Z 的 0/1 和 2 日期的天数差异。
例如3000(差价我取腹肌):
3000 2011-01-20 2 3.43 2.01 1.63 2011-02-01 0 12
我应该使用什么功能?申请? ddply?
提前致谢
您可以试试这个(请注意您如何定义日期操作,因为这在您的问题中并不清楚):
library(tidyverse)
library(lubridate)
#Data
df1 <- structure(list(colA = c(3000L, 3001L, 3002L, 3003L, 3004L, 3006L
), colB = c("2011-01-20", "2012-04-06", "2012-04-24", "2012-04-13",
"2011-08-24", "2011-08-02"), colC = c(2L, 1L, 2L, 2L, 5L, 2L),
colD = c(3.43, 1.12, 2.28, 1.27, 5.3, 2.12), colE = c(2.01,
-0.63, -0.18, -0.51, 2.68, -0.27), colF = c(1.63, -1.16,
-0.12, -0.82, 2.1, -2.6)), class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(colX = c(3000L, 3000L, 3000L, 3000L, 3000L, 3000L,
3002L, 3002L, 3002L), colY = c("2011-02-01", "2012-03-01", "2013-02-01",
"2014-03-01", "2015-03-01", "2016-04-01", "2011-03-01", "2011-08-01",
"2012-04-01"), colZ = c(0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-9L))
#Code
#Compute
dfo <- df2 %>% rename(colA=colX) %>% left_join(df1) %>%
mutate(Diff=abs(12*(year(as.Date(colB))-year(as.Date(colY)))+month(as.Date(colB))-month(as.Date(colY))),
Diffdays=abs(as.Date(colB)-as.Date(colY))) %>% group_by(colA) %>%
filter(Diff==min(Diff))
#Format
vars <- c(names(df1),names(df2)[-1],'Diff','Diffdays')
#Data
dfo %>% select(vars)
# A tibble: 2 x 10
# Groups: colA [2]
colA colB colC colD colE colF colY colZ Diff Diffdays
<int> <chr> <int> <dbl> <dbl> <dbl> <chr> <int> <dbl> <drtn>
1 3000 2011-01-20 2 3.43 2.01 1.63 2011-02-01 0 1 12 days
2 3002 2012-04-24 2 2.28 -0.18 -0.12 2012-04-01 0 0 23 days
请检查这是否符合您的要求。
我的问题很简单。我有 2 个数据框,每个数据框都有一列日期 (%Y-%m-%d) 和一列 ID。一个每行只有一个 ID,另一个具有相同 ID 的多行。我想获取该值,以便它显示日期的最小差异。现在我用一个例子更好地解释:
df1(colA 的单个值):
+-------+------------+------+------+-------+-------+
| colA | colB | colC | colD | colE | colF |
+-------+------------+------+------+-------+-------+
| 3000 | 2011-01-20 | 2 | 3.43 | 2.01 | 1.63 |
| 3001 | 2012-04-06 | 1 | 1.12 | -0.63 | -1.16 |
| 3002 | 2012-04-24 | 2 | 2.28 | -0.18 | -0.12 |
| 3003 | 2012-04-13 | 2 | 1.27 | -0.51 | -0.82 |
| 3004 | 2011-08-24 | 5 | 5.30 | 2.68 | 2.10 |
| 3006 | 2011-08-02 | 2 | 2.12 | -0.27 | -2.60 |
+-------+------------+------+------+-------+-------+
df2(第一列 (X) 的多个值):
+------+---------------+----------+
| colX | colY | colZ |
+------+---------------+----------+
| 3000 | 2011-02-01 | 0 |
| 3000 | 2012-03-01 | 0 |
| 3000 | 2013-02-01 | 0 |
| 3000 | 2014-03-01 | 1 |
| 3000 | 2015-03-01 | 0 |
| 3000 | 2016-04-01 | 0 |
| 3002 | 2011-03-01 | 1 |
| 3002 | 2011-08-01 | 1 |
| 3002 | 2012-04-01 | 0 |
+------+---------------+----------+
在这种情况下,我看到 colA (df1) 中的第一个值,并计算 2011-01-20 与 df2 中 3000 的所有日期(2011-02-01、2012-03)之间的所有月份差异-01,ecc),所以前 6 行。我只取最小差值,因此在本例中是第一个差值 (2011-02-01),差值将近一个月。所以最后我应该让 df1 有 3 个新列(Y 和 Z 和差异)所以 df2 上的最小日期,Z 的 0/1 和 2 日期的天数差异。
例如3000(差价我取腹肌):
3000 2011-01-20 2 3.43 2.01 1.63 2011-02-01 0 12
我应该使用什么功能?申请? ddply?
提前致谢
您可以试试这个(请注意您如何定义日期操作,因为这在您的问题中并不清楚):
library(tidyverse)
library(lubridate)
#Data
df1 <- structure(list(colA = c(3000L, 3001L, 3002L, 3003L, 3004L, 3006L
), colB = c("2011-01-20", "2012-04-06", "2012-04-24", "2012-04-13",
"2011-08-24", "2011-08-02"), colC = c(2L, 1L, 2L, 2L, 5L, 2L),
colD = c(3.43, 1.12, 2.28, 1.27, 5.3, 2.12), colE = c(2.01,
-0.63, -0.18, -0.51, 2.68, -0.27), colF = c(1.63, -1.16,
-0.12, -0.82, 2.1, -2.6)), class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(colX = c(3000L, 3000L, 3000L, 3000L, 3000L, 3000L,
3002L, 3002L, 3002L), colY = c("2011-02-01", "2012-03-01", "2013-02-01",
"2014-03-01", "2015-03-01", "2016-04-01", "2011-03-01", "2011-08-01",
"2012-04-01"), colZ = c(0L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-9L))
#Code
#Compute
dfo <- df2 %>% rename(colA=colX) %>% left_join(df1) %>%
mutate(Diff=abs(12*(year(as.Date(colB))-year(as.Date(colY)))+month(as.Date(colB))-month(as.Date(colY))),
Diffdays=abs(as.Date(colB)-as.Date(colY))) %>% group_by(colA) %>%
filter(Diff==min(Diff))
#Format
vars <- c(names(df1),names(df2)[-1],'Diff','Diffdays')
#Data
dfo %>% select(vars)
# A tibble: 2 x 10
# Groups: colA [2]
colA colB colC colD colE colF colY colZ Diff Diffdays
<int> <chr> <int> <dbl> <dbl> <dbl> <chr> <int> <dbl> <drtn>
1 3000 2011-01-20 2 3.43 2.01 1.63 2011-02-01 0 1 12 days
2 3002 2012-04-24 2 2.28 -0.18 -0.12 2012-04-01 0 0 23 days
请检查这是否符合您的要求。