如何 cross-reference R 中的 tibbles?
How to cross-reference tibbles in R?
library(nycflights13)
library(tidyverse)
我的任务是
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error).
我生成了一个关于每两个机场之间的平均飞行时间的小标题:
# A tibble: 224 x 3
# Groups: origin [?]
origin dest mean_time
<chr> <chr> <dbl>
1 EWR ALB 31.78708
2 EWR ANC 413.12500
3 EWR ATL 111.99385
4 EWR AUS 211.24765
5 EWR AVL 89.79681
6 EWR BDL 25.46602
7 EWR BNA 114.50915
8 EWR BOS 40.31275
9 EWR BQN 196.17288
10 EWR BTV 46.25734
# ... with 214 more rows
现在我需要扫描 flights
并提取所有行,其 air_time 在 say (mean_time/2, mean_time*2) 之外。我该怎么做?
您首先要将该平均飞行时间数据框加入到您的原始 flights
数据中,然后应用过滤器。像这样的东西应该有用。
library(nycflights13)
library(tidyverse)
data("flights")
#get mean time
mean_time <- flights %>%
group_by(origin, dest) %>%
summarise(mean_time = mean(air_time, na.rm = TRUE))
#join mean time to original data
df <- left_join(flights, mean_time)
flag_flights <- df %>%
filter(air_time <= (mean_time / 2) | air_time >= (mean_time * 2))
> flag_flights
# A tibble: 29 x 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 16 635 608 27 916 725 111 UA 541 N837UA EWR BOS 81 200 6 8
2 2013 1 21 1851 1900 -9 2034 2012 22 US 2140 N956UW LGA BOS 76 184 19 0
3 2013 1 28 1917 1825 52 2118 1935 103 US 1860 N755US LGA PHL 75 96 18 25
4 2013 10 7 1059 1105 -6 1306 1215 51 MQ 3230 N524MQ JFK DCA 96 213 11 5
5 2013 10 10 950 959 -9 1155 1115 40 EV 5711 N829AS JFK IAD 97 228 9 59
6 2013 2 17 841 840 1 1044 1003 41 9E 3422 N913XJ JFK BOS 86 187 8 40
7 2013 3 8 1136 1001 95 1409 1116 173 UA 1240 N17730 EWR BOS 82 200 10 1
8 2013 3 8 1246 1245 1 1552 1350 122 AA 1850 N3FEAA JFK BOS 80 187 12 45
9 2013 3 12 1607 1500 67 1803 1608 115 US 2132 N946UW LGA BOS 77 184 15 0
10 2013 3 12 1612 1557 15 1808 1720 48 UA 1116 N37252 EWR BOS 81 200 15 57
# ... with 19 more rows, and 2 more variables: time_hour <dttm>, mean_time <dbl>
假设您已经存储了包含平均飞行时间的小标题,将其加入航班 table:
flights_suspicious <- left_join(flights, average_flight_times, by=c("origin","dest")) %>%
filter(air_time < mean_time / 2 | air_time > mean_time * 2)
library(nycflights13)
library(tidyverse)
我的任务是
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error).
我生成了一个关于每两个机场之间的平均飞行时间的小标题:
# A tibble: 224 x 3
# Groups: origin [?]
origin dest mean_time
<chr> <chr> <dbl>
1 EWR ALB 31.78708
2 EWR ANC 413.12500
3 EWR ATL 111.99385
4 EWR AUS 211.24765
5 EWR AVL 89.79681
6 EWR BDL 25.46602
7 EWR BNA 114.50915
8 EWR BOS 40.31275
9 EWR BQN 196.17288
10 EWR BTV 46.25734
# ... with 214 more rows
现在我需要扫描 flights
并提取所有行,其 air_time 在 say (mean_time/2, mean_time*2) 之外。我该怎么做?
您首先要将该平均飞行时间数据框加入到您的原始 flights
数据中,然后应用过滤器。像这样的东西应该有用。
library(nycflights13)
library(tidyverse)
data("flights")
#get mean time
mean_time <- flights %>%
group_by(origin, dest) %>%
summarise(mean_time = mean(air_time, na.rm = TRUE))
#join mean time to original data
df <- left_join(flights, mean_time)
flag_flights <- df %>%
filter(air_time <= (mean_time / 2) | air_time >= (mean_time * 2))
> flag_flights
# A tibble: 29 x 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 16 635 608 27 916 725 111 UA 541 N837UA EWR BOS 81 200 6 8
2 2013 1 21 1851 1900 -9 2034 2012 22 US 2140 N956UW LGA BOS 76 184 19 0
3 2013 1 28 1917 1825 52 2118 1935 103 US 1860 N755US LGA PHL 75 96 18 25
4 2013 10 7 1059 1105 -6 1306 1215 51 MQ 3230 N524MQ JFK DCA 96 213 11 5
5 2013 10 10 950 959 -9 1155 1115 40 EV 5711 N829AS JFK IAD 97 228 9 59
6 2013 2 17 841 840 1 1044 1003 41 9E 3422 N913XJ JFK BOS 86 187 8 40
7 2013 3 8 1136 1001 95 1409 1116 173 UA 1240 N17730 EWR BOS 82 200 10 1
8 2013 3 8 1246 1245 1 1552 1350 122 AA 1850 N3FEAA JFK BOS 80 187 12 45
9 2013 3 12 1607 1500 67 1803 1608 115 US 2132 N946UW LGA BOS 77 184 15 0
10 2013 3 12 1612 1557 15 1808 1720 48 UA 1116 N37252 EWR BOS 81 200 15 57
# ... with 19 more rows, and 2 more variables: time_hour <dttm>, mean_time <dbl>
假设您已经存储了包含平均飞行时间的小标题,将其加入航班 table:
flights_suspicious <- left_join(flights, average_flight_times, by=c("origin","dest")) %>%
filter(air_time < mean_time / 2 | air_time > mean_time * 2)