相同 [=10R) 内的重叠日期时间数
Number of overlaping datetime inside same table (R)
我有一个 table 大约 50 000 行,四列。
ID Arrival Departure Gender
1 10/04/2015 23:14 11/04/2015 00:21 F
1 11/04/2015 07:59 11/04/2015 08:08 F
3 10/04/2017 21:53 30/03/2017 23:37 M
3 31/03/2017 07:09 31/03/2017 07:57 M
3 01/04/2017 01:32 01/04/2017 01:35 M
3 01/04/2017 13:09 01/04/2017 14:23 M
6 10/04/2015 21:31 10/04/2015 23:17 F
6 10/04/2015 23:48 11/04/2015 00:05 F
6 01/04/2016 21:45 01/04/2016 22:48 F
6 02/04/2016 04:54 02/04/2016 07:38 F
6 04/04/2016 18:41 04/04/2016 22:48 F
10 10/04/2015 22:39 11/04/2015 00:42 M
10 13/04/2015 02:57 13/04/2015 03:07 M
10 31/03/2016 22:29 01/04/2016 08:39 M
10 01/04/2016 18:49 01/04/2016 19:44 M
10 01/04/2016 22:28 02/04/2016 00:31 M
10 05/04/2017 09:27 05/04/2017 09:28 M
10 06/04/2017 15:12 06/04/2017 15:43 M
这是 table 的一个非常小的表示。我想知道的是,在每个条目的同时,还有多少其他人在场,然后按性别将他们分开。因此,例如,在 ID 为 1 的人第一次出现时,ID 为 6 的人出现了,而 ID 为 10 的人在相同的时间间隔内出现了两次。这意味着同时发生了另外 2 次重叠。这也意味着 ID 为 1 的人与 1 男 1 女重叠。
所以它的结果应该是这样的:
ID Arrival Departure Males encountered Females encountered
1 10/04/2015 23:14 11/04/2015 00:21 1 1
我该如何计算呢?我曾尝试使用 foverlaps 并设法用 Excel 解决了这个问题,但我想在 R 中进行。
这是一种可能。这使用 lubridate's interval and the int_overlaps function that finds date overlaps. That has a drawback though: Interval doesn't work with dplyr。所以这个版本只是在 for 循环中手动完成所有工作。
它首先制作一个 1000 行的随机数据集来匹配你的数据集:每个人在两年内到达并在一两天后离开。
从 1000 到 运行 大约需要 24 秒,因此您可以预计到 50K 需要一段时间! for 循环输出行号,因此您可以看到它在哪里。
关于代码的任何问题,请告诉我。
一定有更快的向量化方法,但间隔似乎也不能很好地与 apply 配合使用。其他人可能会更快...
最终输出看起来像 this
library(tidyverse)
library(lubridate)
#Sample data:
#(Date sampling code:
#Random dates between 2017 and 2019
x <- data.frame(
ID = c(1:1000),
Arrival = sample(seq(as.Date('2017/01/01'), as.Date('2019/01/01'), by="day"), 1000, replace = T),
Gender = ifelse(rbinom(1000,1,0.5),'Male','Female')#Random Male female 50% probabiliity
)
#Make departure one or two days after arrival
x$Departure = x$Arrival + sample(1:2,1000, replace=T)
#Lubridate has a function for checking whether date intervals overlap
#https://lubridate.tidyverse.org/reference/interval.html
#So first, let's make the arrival and departure dates into intervals
x$interval <- interval(x$Arrival,x$Departure)
#Then for every person / row
#We want to know if their interval overlaps with the rest
#At the moment, dplyr doesn't play nice with interval
#https://github.com/tidyverse/dplyr/issues/3206
#So let's go through each row and do this manually
#Keep each person's result in list initially
gendercounts <- list()
#Check timing
t <- proc.time()
#Go through every row manually (sigh!
for(i in 1:nrow(x)){
print(paste0("Row ",i))
#exclude self (don't want to check date overlap with myself)
overlapcheck <- x[x$ID != x$ID[i],]
#Find out what dates this person overlaps with - can do all other intervals in one command
overlapcheck$overlaps <- int_overlaps(x$interval[i],overlapcheck$interval)
#Eyeball check that is finding the overlaps we want
#Is this ID date overlapping? Tick
#View(overlapcheck[overlapcheck$overlaps,])
#Use dplyr to find out the number of overlaps for male and female
#Keep only columns where the overlap is TRUE
#Also drop the interval column first tho as dplyr doesn't like it... (not tidy!)
gendercount <- overlapcheck %>%
select(-interval) %>%
filter(overlaps) %>%
group_by(Gender) %>%
summarise(count = n()) %>% #Get count of observations for each overlap for each sex
complete(Gender, fill = list(count = 0))#Need this to keep zero counts: summarise drops them otherwise
#We want count for each gender in their own column, so make wide
gendercount <- gendercount %>%
spread(key = Gender, value = count)
#Store for turning into dataframe shortly
gendercounts[[length(gendercounts)+1]] <- gendercount
}
#Dlyr command: turn list into dataframe
gendercounts <- bind_rows(gendercounts)
#End result. Drop interval column, order columns
final <- cbind(x,gendercounts) %>%
select(ID,Arrival,Departure,Gender,Male,Female)
#~24 seconds per thousand
proc.time()-t
这是使用 foverlaps
的 data.table
解决方案。
首先,请注意您的数据中存在错误:
ID Arrival Departure Gender
3 10/04/2017 21:53 30/03/2017 23:37 M
用户在实际离开后将近一个月到达。为了 foverlaps
到 运行.
,我需要删除这些数据
library(data.table)
dt <- data.table(df)
dt <- dt[Departure > Arrival, ] # filter wrong cases
setkey(dt, "Arrival", "Departure") # prepare for foverlaps
dt2 <- copy(dt) # use a different dt, inherits the key
运行 重叠然后
- 过滤(仅留下)第二个人的到达比 ID 和相同用户案例 早于 的案例。
- 添加一个变量,我们计算同时入住的男性客人,
- 我们计算同时入住的女性客人的变量,所有客人均按 ID 和到达人数分组
.
simultaneous <- foverlaps(dt, dt2)[i.Arrival <= Arrival & ID != i.ID,
.(malesEncountered = sum(i.Gender == "M"),
femalesEncountered = sum(i.Gender == "F")),
by = .(ID, Arrival)]
将上一个命令的结果与我们原来的 table ID 和 arrival
结合起来
result <- simultaneous[dt, on = .(ID, Arrival)]
:将 malesEncountered
和 femalesEncountered
中的 NA 转换为零:
result[is.na(malesEncountered), malesEncountered := 0][
is.na(femalesEncountered), femalesEncountered := o]
将列顺序设置得更好
setcolorder(result, c(1, 2, 5, 6, 3, 4))[]
我有一个 table 大约 50 000 行,四列。
ID Arrival Departure Gender
1 10/04/2015 23:14 11/04/2015 00:21 F
1 11/04/2015 07:59 11/04/2015 08:08 F
3 10/04/2017 21:53 30/03/2017 23:37 M
3 31/03/2017 07:09 31/03/2017 07:57 M
3 01/04/2017 01:32 01/04/2017 01:35 M
3 01/04/2017 13:09 01/04/2017 14:23 M
6 10/04/2015 21:31 10/04/2015 23:17 F
6 10/04/2015 23:48 11/04/2015 00:05 F
6 01/04/2016 21:45 01/04/2016 22:48 F
6 02/04/2016 04:54 02/04/2016 07:38 F
6 04/04/2016 18:41 04/04/2016 22:48 F
10 10/04/2015 22:39 11/04/2015 00:42 M
10 13/04/2015 02:57 13/04/2015 03:07 M
10 31/03/2016 22:29 01/04/2016 08:39 M
10 01/04/2016 18:49 01/04/2016 19:44 M
10 01/04/2016 22:28 02/04/2016 00:31 M
10 05/04/2017 09:27 05/04/2017 09:28 M
10 06/04/2017 15:12 06/04/2017 15:43 M
这是 table 的一个非常小的表示。我想知道的是,在每个条目的同时,还有多少其他人在场,然后按性别将他们分开。因此,例如,在 ID 为 1 的人第一次出现时,ID 为 6 的人出现了,而 ID 为 10 的人在相同的时间间隔内出现了两次。这意味着同时发生了另外 2 次重叠。这也意味着 ID 为 1 的人与 1 男 1 女重叠。
所以它的结果应该是这样的:
ID Arrival Departure Males encountered Females encountered
1 10/04/2015 23:14 11/04/2015 00:21 1 1
我该如何计算呢?我曾尝试使用 foverlaps 并设法用 Excel 解决了这个问题,但我想在 R 中进行。
这是一种可能。这使用 lubridate's interval and the int_overlaps function that finds date overlaps. That has a drawback though: Interval doesn't work with dplyr。所以这个版本只是在 for 循环中手动完成所有工作。
它首先制作一个 1000 行的随机数据集来匹配你的数据集:每个人在两年内到达并在一两天后离开。
从 1000 到 运行 大约需要 24 秒,因此您可以预计到 50K 需要一段时间! for 循环输出行号,因此您可以看到它在哪里。
关于代码的任何问题,请告诉我。
一定有更快的向量化方法,但间隔似乎也不能很好地与 apply 配合使用。其他人可能会更快...
最终输出看起来像 this
library(tidyverse)
library(lubridate)
#Sample data:
#(Date sampling code:
#Random dates between 2017 and 2019
x <- data.frame(
ID = c(1:1000),
Arrival = sample(seq(as.Date('2017/01/01'), as.Date('2019/01/01'), by="day"), 1000, replace = T),
Gender = ifelse(rbinom(1000,1,0.5),'Male','Female')#Random Male female 50% probabiliity
)
#Make departure one or two days after arrival
x$Departure = x$Arrival + sample(1:2,1000, replace=T)
#Lubridate has a function for checking whether date intervals overlap
#https://lubridate.tidyverse.org/reference/interval.html
#So first, let's make the arrival and departure dates into intervals
x$interval <- interval(x$Arrival,x$Departure)
#Then for every person / row
#We want to know if their interval overlaps with the rest
#At the moment, dplyr doesn't play nice with interval
#https://github.com/tidyverse/dplyr/issues/3206
#So let's go through each row and do this manually
#Keep each person's result in list initially
gendercounts <- list()
#Check timing
t <- proc.time()
#Go through every row manually (sigh!
for(i in 1:nrow(x)){
print(paste0("Row ",i))
#exclude self (don't want to check date overlap with myself)
overlapcheck <- x[x$ID != x$ID[i],]
#Find out what dates this person overlaps with - can do all other intervals in one command
overlapcheck$overlaps <- int_overlaps(x$interval[i],overlapcheck$interval)
#Eyeball check that is finding the overlaps we want
#Is this ID date overlapping? Tick
#View(overlapcheck[overlapcheck$overlaps,])
#Use dplyr to find out the number of overlaps for male and female
#Keep only columns where the overlap is TRUE
#Also drop the interval column first tho as dplyr doesn't like it... (not tidy!)
gendercount <- overlapcheck %>%
select(-interval) %>%
filter(overlaps) %>%
group_by(Gender) %>%
summarise(count = n()) %>% #Get count of observations for each overlap for each sex
complete(Gender, fill = list(count = 0))#Need this to keep zero counts: summarise drops them otherwise
#We want count for each gender in their own column, so make wide
gendercount <- gendercount %>%
spread(key = Gender, value = count)
#Store for turning into dataframe shortly
gendercounts[[length(gendercounts)+1]] <- gendercount
}
#Dlyr command: turn list into dataframe
gendercounts <- bind_rows(gendercounts)
#End result. Drop interval column, order columns
final <- cbind(x,gendercounts) %>%
select(ID,Arrival,Departure,Gender,Male,Female)
#~24 seconds per thousand
proc.time()-t
这是使用 foverlaps
的 data.table
解决方案。
首先,请注意您的数据中存在错误:
ID Arrival Departure Gender
3 10/04/2017 21:53 30/03/2017 23:37 M
用户在实际离开后将近一个月到达。为了 foverlaps
到 运行.
library(data.table)
dt <- data.table(df)
dt <- dt[Departure > Arrival, ] # filter wrong cases
setkey(dt, "Arrival", "Departure") # prepare for foverlaps
dt2 <- copy(dt) # use a different dt, inherits the key
运行 重叠然后
- 过滤(仅留下)第二个人的到达比 ID 和相同用户案例 早于 的案例。
- 添加一个变量,我们计算同时入住的男性客人,
- 我们计算同时入住的女性客人的变量,所有客人均按 ID 和到达人数分组
.
simultaneous <- foverlaps(dt, dt2)[i.Arrival <= Arrival & ID != i.ID,
.(malesEncountered = sum(i.Gender == "M"),
femalesEncountered = sum(i.Gender == "F")),
by = .(ID, Arrival)]
将上一个命令的结果与我们原来的 table ID 和 arrival
结合起来result <- simultaneous[dt, on = .(ID, Arrival)]
:将 malesEncountered
和 femalesEncountered
中的 NA 转换为零:
malesEncountered
和 femalesEncountered
中的 NA 转换为零:result[is.na(malesEncountered), malesEncountered := 0][
is.na(femalesEncountered), femalesEncountered := o]
将列顺序设置得更好
setcolorder(result, c(1, 2, 5, 6, 3, 4))[]