加入两个间隔不当的数据帧?
Joining two data frames with intervals misbehaves?
编辑(2019-06):此问题不再存在,因为 this issue 已关闭并实现了相关功能。如果您现在 运行 包含更新包的代码,它将起作用。
我试图找到重叠的间隔,并决定将间隔数据与 dplyr::left_join()
连接起来,这样我就可以将具有 lubridate::int_overlaps()
的间隔与具有相同 ID 的每个其他间隔进行比较。
这是我期望 left_join()
的行为方式。三行的两个小标题交叉形成 9 行的 tibble:
library(tidyverse)
tibble(a = rep("a", 3), b = rep(1, 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(2, 3)))
Joining, by = "a"
# A tibble: 9 x 3
a b c
<chr> <dbl> <dbl>
1 a 1 2
2 a 1 2
3 a 1 2
4 a 1 2
5 a 1 2
6 a 1 2
7 a 1 2
8 a 1 2
9 a 1 2
下面是相同代码在不同时间间隔内的行为方式。我有九行,但行不像上面那样交叉:
tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
Joining, by = "a"
# A tibble: 9 x 3
a b c
<chr> <S4: Interval> <S4: Interval>
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a NA--NA NA--NA
5 a NA--NA NA--NA
6 a NA--NA NA--NA
7 a NA--NA NA--NA
8 a NA--NA NA--NA
9 a NA--NA NA--NA
我认为这是出乎意料的,但我可能遗漏了什么?或者这是一个错误?
看起来像是 tibble()
中的错误:
> AA <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3))
> class(AA$b)
[1] "Interval"
attr(,"package")
[1] "lubridate"
> AA
Error in round_x - lhs :
Arithmetic operators undefined for 'Interval' and 'Interval' classes:
convert one to numeric or a matching time-span class.
但是:
> AA <- as.data.frame(AA)
class(AA$b)
> class(AA$b)
[1] "Interval"
attr(,"package")
[1] "lubridate"
> AA
a b
1 a 2001-01-01 UTC--2002-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC
因此,这有效:
> AA <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3))
> BB <- tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003)))
> AA %>% as.data.frame %>% left_join(BB)
Joining, by = "a"
a b c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
虽然这不是:
> AA %>% left_join(BB)
Joining, by = "a"
Error in round_x - lhs :
Arithmetic operators undefined for 'Interval' and 'Interval' classes:
convert one to numeric or a matching time-span class.
注意:我在 x86_64-pc-linux-gnu
的 R 3.4.3 上使用 tibble_1.4.1(与您相同版本的 lubridate 和 dplyr)
错误
该对象还包含相关信息:
res <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
print.data.frame(res)
# a b c
# 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
res$c
# [1] 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# [5] 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# [9] 2002-01-01 UTC--2003-01-01 UTC
但是当按索引进行子集化时,它不再起作用了:
res_df <- as.data.frame(res)
head(res_df)
a b c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a NA--NA NA--NA
5 a NA--NA NA--NA
6 a NA--NA NA--NA
res_df[4,"c"]
[1] NA--NA
和tibble:::print.tbl
利用了head
。这就是为什么使用 tibbles
而不是 data.frames
.
时问题会立即可见的原因
键入 str(res$b)
我们看到我们只有 3 start
个值对应 9 个 data
个值。
如果我们这样做:
res_df$b@start <- rep(res_df$b@start,3)
res_df$c@start <- rep(res_df$c@start,3)
现在打印一切正常:
a b c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
解决方案
我们已经看到 as.data.frame
是不够的,left_join
是把事情搞砸的函数,请改用 merge
:
res <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
merge(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))),
all.x=TRUE)
head(res)
# a b c
# 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
res[4,"c"]
#[1] 2002-01-01 UTC--2003-01-01 UTC
我已经报告了这个问题here
此问题已不存在,因为 this issue 已关闭并实现了相关功能。如果您现在 运行 包含更新包的代码,它将起作用。
library(lubridate)
library(tidyverse)
tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
#> Joining, by = "a"
#> # A tibble: 9 x 3
#> a b c
#> <chr> <Interval> <Interval>
#> 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
由 reprex package (v0.3.0)
于 2019-06-07 创建
编辑(2019-06):此问题不再存在,因为 this issue 已关闭并实现了相关功能。如果您现在 运行 包含更新包的代码,它将起作用。
我试图找到重叠的间隔,并决定将间隔数据与 dplyr::left_join()
连接起来,这样我就可以将具有 lubridate::int_overlaps()
的间隔与具有相同 ID 的每个其他间隔进行比较。
这是我期望 left_join()
的行为方式。三行的两个小标题交叉形成 9 行的 tibble:
library(tidyverse)
tibble(a = rep("a", 3), b = rep(1, 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(2, 3)))
Joining, by = "a"
# A tibble: 9 x 3
a b c
<chr> <dbl> <dbl>
1 a 1 2
2 a 1 2
3 a 1 2
4 a 1 2
5 a 1 2
6 a 1 2
7 a 1 2
8 a 1 2
9 a 1 2
下面是相同代码在不同时间间隔内的行为方式。我有九行,但行不像上面那样交叉:
tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
Joining, by = "a"
# A tibble: 9 x 3
a b c
<chr> <S4: Interval> <S4: Interval>
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a NA--NA NA--NA
5 a NA--NA NA--NA
6 a NA--NA NA--NA
7 a NA--NA NA--NA
8 a NA--NA NA--NA
9 a NA--NA NA--NA
我认为这是出乎意料的,但我可能遗漏了什么?或者这是一个错误?
看起来像是 tibble()
中的错误:
> AA <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3))
> class(AA$b)
[1] "Interval"
attr(,"package")
[1] "lubridate"
> AA
Error in round_x - lhs :
Arithmetic operators undefined for 'Interval' and 'Interval' classes:
convert one to numeric or a matching time-span class.
但是:
> AA <- as.data.frame(AA)
class(AA$b)
> class(AA$b)
[1] "Interval"
attr(,"package")
[1] "lubridate"
> AA
a b
1 a 2001-01-01 UTC--2002-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC
因此,这有效:
> AA <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3))
> BB <- tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003)))
> AA %>% as.data.frame %>% left_join(BB)
Joining, by = "a"
a b c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
虽然这不是:
> AA %>% left_join(BB)
Joining, by = "a"
Error in round_x - lhs :
Arithmetic operators undefined for 'Interval' and 'Interval' classes:
convert one to numeric or a matching time-span class.
注意:我在 x86_64-pc-linux-gnu
的 R 3.4.3 上使用 tibble_1.4.1(与您相同版本的 lubridate 和 dplyr)错误
该对象还包含相关信息:
res <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
print.data.frame(res)
# a b c
# 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
res$c
# [1] 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# [5] 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# [9] 2002-01-01 UTC--2003-01-01 UTC
但是当按索引进行子集化时,它不再起作用了:
res_df <- as.data.frame(res)
head(res_df)
a b c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a NA--NA NA--NA
5 a NA--NA NA--NA
6 a NA--NA NA--NA
res_df[4,"c"]
[1] NA--NA
和tibble:::print.tbl
利用了head
。这就是为什么使用 tibbles
而不是 data.frames
.
键入 str(res$b)
我们看到我们只有 3 start
个值对应 9 个 data
个值。
如果我们这样做:
res_df$b@start <- rep(res_df$b@start,3)
res_df$c@start <- rep(res_df$c@start,3)
现在打印一切正常:
a b c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
解决方案
我们已经看到 as.data.frame
是不够的,left_join
是把事情搞砸的函数,请改用 merge
:
res <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
merge(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))),
all.x=TRUE)
head(res)
# a b c
# 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
res[4,"c"]
#[1] 2002-01-01 UTC--2003-01-01 UTC
我已经报告了这个问题here
此问题已不存在,因为 this issue 已关闭并实现了相关功能。如果您现在 运行 包含更新包的代码,它将起作用。
library(lubridate)
library(tidyverse)
tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
#> Joining, by = "a"
#> # A tibble: 9 x 3
#> a b c
#> <chr> <Interval> <Interval>
#> 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
由 reprex package (v0.3.0)
于 2019-06-07 创建