data.frame 子集化:使用数字日期差异的非常奇怪的行为
data.frame subsetting: very strange behavior using numeric date difference
在用日期子集数据框时,有一种我不明白的行为。这是示例(末尾的数据):
> df
positif1 positif2 date1 date2
1 0 0 2020-05-02 2020-04-30
2 0 0 2020-05-02 2020-04-21
3 0 0 2020-05-02 2020-04-30
.
.
我天真地想把这个子集化:
df[df$positif2 == 0 & df$positif1 == 0 & as.numeric(df$date1 - df$date2) <=2,]
它没有 return 任何东西,尽管显然有满足条件的值:
as.numeric(df[df$positif2 == 0 & df$positif1 == 0,"date1"]-df[df$positif2 == 0 & df$positif1 == 0,"date2"])
[1] 2 11 2 29 12 18 1 22 5 24 5 6 4 25 9 9 13 16 17 35 5 35 22 51 3 17 8 16 12 15 14 21 14
[34] 4
我意识到正确的方法是:
df[df$positif2 == 0 & df$positif1 == 0 & difftime(df$date1, df$date2,units = "day") <=2,]
而且我的问题是在对数据框进行子集化时单位会发生变化:
> df$date1 - df$date2
Time differences in secs
[1] 172800 950400 172800 2505600 1036800 1555200 2073600 86400 1900800 432000 2073600 432000
[13] 518400 345600 2160000 777600 777600 0 1123200 1382400 1468800 3024000 0 432000
[25] 3024000 1900800 4406400 0 259200 1468800 691200 1382400 1036800 1296000 1209600 1814400
[37] 1209600 345600
> df[df$positif2 == 0,"date1"] - df[df$positif2 == 0,"date2"]
Time differences in days
[1] 2 11 2 29 12 18 24 1 22 5 24 5 6 4 25 9 9 13 16 17 35 5 35 22 51 3 17 8 16 12 15 14 21
[34] 14 4
这对我来说毫无意义。我做错了什么吗?这种行为有原因吗?
数据:
df <- structure(list(positif1 = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), positif2 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), date1 = structure(c(1588377600,
1588377600, 1588377600, 1588377600, 1588377600, 1588377600, 1588377600,
1588377600, 1588377600, 1588377600, 1588377600, 1588377600, 1588377600,
1588377600, 1588377600, 1588377600, 1588377600, 1584057600, 1588377600,
1588377600, 1588377600, 1588377600, 1586563200, 1588377600, 1588377600,
1588377600, 1588377600, 1586908800, 1588377600, 1588377600, 1588377600,
1588377600, 1588377600, 1588377600, 1588377600, 1588377600, 1588377600,
1588377600), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
date2 = structure(c(1588204800, 1587427200, 1588204800, 1585872000,
1587340800, 1586822400, 1586304000, 1588291200, 1586476800,
1587945600, 1586304000, 1587945600, 1587859200, 1588032000,
1586217600, 1587600000, 1587600000, 1584057600, 1587254400,
1586995200, 1586908800, 1585353600, 1586563200, 1587945600,
1585353600, 1586476800, 1583971200, 1586908800, 1588118400,
1586908800, 1587686400, 1586995200, 1587340800, 1587081600,
1587168000, 1586563200, 1587168000, 1588032000), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -38L), class = "data.frame", index = structure(integer(0), "`__positif1__positif2`" = c(1L,
2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 19L, 20L, 21L, 22L, 24L, 25L, 26L, 27L, 29L, 30L, 31L, 32L,
33L, 34L, 35L, 36L, 37L, 38L, 7L, 18L, 23L, 28L)))
来自?difftime
Subtraction of date-time objects gives an object of this class, by calling difftime with units = "auto".
所以我们知道在减去日期时默认 units = "auto"
。
其次,
If units = "auto", a suitable set of units is chosen, the largest possible (excluding "weeks") in which all the absolute differences are greater than one.
所以当 units = "auto"
它试图 select 一个最大的单位。所以当我们这样做时
df$date1 - df$date2
在date1
和date2
中有一些相同的条目使它们的差异为0,所以这里选择"seconds"作为单位。
但是当你在 positif2
(positif2 == 0
) 中仅对 0 个条目的日期进行子集化时,最小差异是日期,因此单位 selected 是 "days" 这里。
df$date1[df$positif2 == 0] - df$date2[df$positif2 == 0]
您已经确定的正确方法是使用 difftime
并在其中明确指定 units
参数。
在用日期子集数据框时,有一种我不明白的行为。这是示例(末尾的数据):
> df
positif1 positif2 date1 date2
1 0 0 2020-05-02 2020-04-30
2 0 0 2020-05-02 2020-04-21
3 0 0 2020-05-02 2020-04-30
.
.
我天真地想把这个子集化:
df[df$positif2 == 0 & df$positif1 == 0 & as.numeric(df$date1 - df$date2) <=2,]
它没有 return 任何东西,尽管显然有满足条件的值:
as.numeric(df[df$positif2 == 0 & df$positif1 == 0,"date1"]-df[df$positif2 == 0 & df$positif1 == 0,"date2"])
[1] 2 11 2 29 12 18 1 22 5 24 5 6 4 25 9 9 13 16 17 35 5 35 22 51 3 17 8 16 12 15 14 21 14
[34] 4
我意识到正确的方法是:
df[df$positif2 == 0 & df$positif1 == 0 & difftime(df$date1, df$date2,units = "day") <=2,]
而且我的问题是在对数据框进行子集化时单位会发生变化:
> df$date1 - df$date2
Time differences in secs
[1] 172800 950400 172800 2505600 1036800 1555200 2073600 86400 1900800 432000 2073600 432000
[13] 518400 345600 2160000 777600 777600 0 1123200 1382400 1468800 3024000 0 432000
[25] 3024000 1900800 4406400 0 259200 1468800 691200 1382400 1036800 1296000 1209600 1814400
[37] 1209600 345600
> df[df$positif2 == 0,"date1"] - df[df$positif2 == 0,"date2"]
Time differences in days
[1] 2 11 2 29 12 18 24 1 22 5 24 5 6 4 25 9 9 13 16 17 35 5 35 22 51 3 17 8 16 12 15 14 21
[34] 14 4
这对我来说毫无意义。我做错了什么吗?这种行为有原因吗?
数据:
df <- structure(list(positif1 = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), positif2 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), date1 = structure(c(1588377600,
1588377600, 1588377600, 1588377600, 1588377600, 1588377600, 1588377600,
1588377600, 1588377600, 1588377600, 1588377600, 1588377600, 1588377600,
1588377600, 1588377600, 1588377600, 1588377600, 1584057600, 1588377600,
1588377600, 1588377600, 1588377600, 1586563200, 1588377600, 1588377600,
1588377600, 1588377600, 1586908800, 1588377600, 1588377600, 1588377600,
1588377600, 1588377600, 1588377600, 1588377600, 1588377600, 1588377600,
1588377600), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
date2 = structure(c(1588204800, 1587427200, 1588204800, 1585872000,
1587340800, 1586822400, 1586304000, 1588291200, 1586476800,
1587945600, 1586304000, 1587945600, 1587859200, 1588032000,
1586217600, 1587600000, 1587600000, 1584057600, 1587254400,
1586995200, 1586908800, 1585353600, 1586563200, 1587945600,
1585353600, 1586476800, 1583971200, 1586908800, 1588118400,
1586908800, 1587686400, 1586995200, 1587340800, 1587081600,
1587168000, 1586563200, 1587168000, 1588032000), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -38L), class = "data.frame", index = structure(integer(0), "`__positif1__positif2`" = c(1L,
2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 19L, 20L, 21L, 22L, 24L, 25L, 26L, 27L, 29L, 30L, 31L, 32L,
33L, 34L, 35L, 36L, 37L, 38L, 7L, 18L, 23L, 28L)))
来自?difftime
Subtraction of date-time objects gives an object of this class, by calling difftime with units = "auto".
所以我们知道在减去日期时默认 units = "auto"
。
其次,
If units = "auto", a suitable set of units is chosen, the largest possible (excluding "weeks") in which all the absolute differences are greater than one.
所以当 units = "auto"
它试图 select 一个最大的单位。所以当我们这样做时
df$date1 - df$date2
在date1
和date2
中有一些相同的条目使它们的差异为0,所以这里选择"seconds"作为单位。
但是当你在 positif2
(positif2 == 0
) 中仅对 0 个条目的日期进行子集化时,最小差异是日期,因此单位 selected 是 "days" 这里。
df$date1[df$positif2 == 0] - df$date2[df$positif2 == 0]
您已经确定的正确方法是使用 difftime
并在其中明确指定 units
参数。