在 R 中找到至少有三年数据的所有航班

finding all flights that have at least three years of data in R

我正在使用 R 中免费提供的航班数据集。

flights <- read_csv("http://ucl.ac.uk/~uctqiax/data/flights.csv")

现在,假设我想找到至少连续飞行三年的所有航班:所以 date 列中有三年可用的日期。基本上我只对数据的 year 部分感兴趣。

我在考虑以下方法:创建所有飞机名称的唯一列表,然后为每个飞机获取所有日期,看看是否有连续三年。

我是这样开始的:

NOyears = 3
planes <- unique(flights$plane) 

# at least 3 consecutive years 
for (plane in planes){
  plane = "N576AA"
  allyears <- which(flights$plane == plane)
}

但我被困在这里了。整个方法开始对我来说太复杂了。有 easier/faster 方法吗?考虑到我正在处理一个非常大的数据集...

注意:我希望以后能够指定年数,这就是为什么我首先包含 NOyears = 3 的原因。

编辑:

我刚刚注意到 关于 SO 的问题。 diffcumsum 的使用非常有趣,这对我来说都是新的。也许这里可以使用类似的方法 data.table?

dplyr 就可以解决这个问题

library(dplyr)
library(lubridate)

flights %>%
  mutate(year = year(date)) %>%
  group_by(plane) %>%
  summarise(range = max(year) - min(year)) %>%
  filter(range >= 2)

虽然我没有看到任何符合标准的飞机!

编辑:Per mnist 的评论,连续几年有点棘手,但这里有一个连续月份的工作示例(您提供的数据只有一年)- 换出几年!

nMonths = 6
flights %>%
  mutate(month = month(date)) %>% #Calculate month
  count(plane, month) %>% #Summarize to one row for each plane/month combo
  arrange(plane, month) %>% #Arrange by plane, month so we can look at consecutive months
  group_by(plane) %>% #Within each plane...
  mutate(consecutiveMonths = c(0, sequence(rle(diff(month))$lengths))) %>% #...calculate the number of consecutive months each row represents
  group_by(plane) %>% #Then, for each plane...
  summarise(maxConsecutiveMonths = max(consecutiveMonths)) %>% #...return the maximum number of consecutive months
  filter(maxConsecutiveMonths > nMonths) #And keep only those planes that meet criteria!

这是一个data.table方法(使用月份,因为该文件中只有一年,过滤在 12 个月内连续运营的航班):

library(data.table)
flights <- fread("http://ucl.ac.uk/~uctqiax/data/flights.csv")
flights[, month:=month(date)]
setkey(flights, plane, date)
flights[, max_run:=lapply(.SD, function(x) max(rle(cumsum(c(0, diff(unique(x))) > 1))$lengths)), 
.SDcols="month", by="plane"][max_run > 11][]
#>                        date hour minute  dep  arr dep_delay arr_delay carrier
#>      1: 2011-01-01 12:00:00   NA     NA   NA   NA        NA        NA      XE
#>      2: 2011-01-01 12:00:00   NA     NA   NA   NA        NA        NA      XE
#>      3: 2011-01-01 12:00:00   NA     NA   NA   NA        NA        NA      XE
#>      4: 2011-01-02 12:00:00   NA     NA   NA   NA        NA        NA      XE
#>      5: 2011-01-02 12:00:00   NA     NA   NA   NA        NA        NA      XE
#>     ---                                                                      
#> 151636: 2011-11-21 12:00:00   10     56 1056 1359        25        37      FL
#> 151637: 2011-12-09 12:00:00   18     36 1836 2126        -5        -4      FL
#> 151638: 2011-12-13 12:00:00   17     27 1727 2013        -3        -7      FL
#> 151639: 2011-12-14 12:00:00    6     28  628  914        -2        -8      FL
#> 151640: 2011-12-14 12:00:00   11     57 1157 1438        -3       -14      FL
#>         flight dest  plane cancelled time dist month max_run
#>      1:   2174  PNS                1   NA  489     1      12
#>      2:   2277  BRO                1   NA  308     1      12
#>      3:   2811  MOB                1   NA  427     1      12
#>      4:   2204  OKC                1   NA  395     1      12
#>      5:   2570  BTR                1   NA  253     1      12
#>     ---                                                     
#> 151636:    298  ATL N983AT         0   98  696    11      12
#> 151637:    296  ATL N983AT         0   89  696    12      12
#> 151638:    292  ATL N983AT         0   87  696    12      12
#> 151639:    290  ATL N983AT         0   86  696    12      12
#> 151640:    286  ATL N983AT         0   87  696    12      12

reprex package (v0.3.0)

于 2020-05-14 创建

这是另一个使用 data.table 的选项:

#summarize into a smaller dataset; assuming that we are not counting days to check for consecutive years
yearly <- flights[, .(year=unique(year(date))), .(carrier, flight)]

#add a dummy flight to demonstrate consecutive years
yearly <- rbindlist(list(yearly, data.table(carrier="ZZ", flight="111", year=2011:2014)))

setkey(yearly, carrier, flight, year)    
yearly[, c("rl", "rw") := {
    iscons <- cumsum(c(0L, diff(year)!=1L))
    .(iscons, rowid(carrier, flight, iscons))
}]

yearly[rl %in% yearly[rw>=3L]$rl]

输出:

   carrier flight year   rl rw
1:      ZZ    111 2011 5117  1
2:      ZZ    111 2012 5117  2
3:      ZZ    111 2013 5117  3
4:      ZZ    111 2014 5117  4