Reduce/filter 数据基于 class 和发生日期
Reduce/filter data based on class and date occurrence
我有一个不同地区不同船只的数据集。我得到的数据输出记录了船只的名称、类型(例如 fishing/cargo)以及它进入该区域的时间、它离开的时间以及它在该区域的持续时间/ DOS 只是离岸距离 - 或区域 i正在看
我的问题是渔船经常横断面,一天内会多次进出该区域,因此会在我的报告输出中多次注明。
我想合并渔船数据,这样如果同名船(仅适用于类型:捕鱼)每天被记录不止一次,除了一个帐户之外的所有帐户都会被删除。为简单起见,也许只看一下 "First seen in zone date",因为我认为当特定持续时间跨越多天时它会变得更加复杂(我可以稍后再回到那个想法)。
虚拟数据:
df <- structure(list(Name = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 8L,
8L, 9L), .Label = c("A", "B", "C", "D", "E", "F", "G", "H", "I"
), class = "factor"), Type = structure(c(2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L,
2L, 1L, 1L, 2L), .Label = c("Cargo", "Fishing"), class = "factor"),
`First seen inside` = structure(c(1556385360, 1556393640,
1556002200, 1556260260, 1556518860, 1556136660, 1556278500,
1556285820, 1556391480, 1556509620, 1556319480, 1556214120,
1556235600, 1556325540, 1556326920, 1556329500, 1556330220,
1556330580, 1556330880, 1556330940, 1556332980, 1556339880,
1556340900, 1556344140, 1556344500, 1556345220, 1556346420,
1556348220, 1556348520, 1556350860, 1556351460, 1556356620,
1556360220, 1556365920, 1556366520, 1556367180, 1556076420,
1556166900, 1556154840, 1556454900, 1556291220), class = c("POSIXct",
"POSIXt"), tzone = ""), `Last seen inside` = structure(c(34L,
35L, 1L, 8L, 38L, 3L, 7L, 9L, 36L, 38L, 27L, 4L, 5L, 10L,
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 26L, 28L, 29L, 30L, 31L, 32L, 33L, 2L, 6L,
37L, 38L, 38L), .Label = c("4/23/2019 14:27", "4/24/2019 21:23",
"4/25/2019 00:00", "4/25/2019 10:47", "4/25/2019 16:59",
"4/25/2019 23:49", "4/26/2019 05:17", "4/26/2019 13:39",
"4/26/2019 15:12", "4/26/2019 17:54", "4/26/2019 18:05",
"4/26/2019 18:51", "4/26/2019 19:00", "4/26/2019 19:06",
"4/26/2019 19:08", "4/26/2019 19:13", "4/26/2019 21:24",
"4/26/2019 21:38", "4/26/2019 22:02", "4/26/2019 22:51",
"4/26/2019 22:55", "4/26/2019 23:22", "4/26/2019 23:51",
"4/27/2019 00:00", "4/27/2019 00:36", "4/27/2019 00:42",
"4/27/2019 01:17", "4/27/2019 02:06", "4/27/2019 03:11",
"4/27/2019 04:30", "4/27/2019 05:00", "4/27/2019 05:03",
"4/27/2019 05:13", "4/27/2019 10:29", "4/27/2019 12:42",
"4/27/2019 17:21", "4/28/2019 03:47", "4/29/2019 09:56"), class =
"factor"),
`Time in zone` = structure(c(5L, 31L, 6L, 7L, 2L, 3L, 23L,
30L, 26L, 4L, 32L, 27L, 9L, 8L, 22L, 28L, 22L, 22L, 1L, 24L,
15L, 1L, 29L, 18L, 1L, 8L, 17L, 22L, 19L, 16L, 14L, 25L,
13L, 31L, 16L, 1L, 12L, 10L, 21L, 11L, 20L), .Label = c("",
"10h 35m", "10h 49m", "13h 9m", "13m", "14h 37m", "14h 8m",
"15m", "19m", "1d 2h 14m", "1d 4h 21m", "1d 56m", "1h 13m",
"1h 15m", "1h 41m", "1m", "24m", "2m", "34m", "3d 1h 49m",
"3d 9h 33m", "3m", "42m", "4m", "54m", "5h 23m", "5m", "6m",
"7m", "8h 35m", "8m", "9h 19m"), class = "factor"), DOS =
structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "0-12", class =
"factor")), row.names = c(NA,
-41L), class = "data.frame")
所以如果例如在我的虚拟数据集中:
由于船 "A" 是 DOS 0-12 中的一艘 "Fishing" 船只,它在 4 月 27 日出现两次,我想将数据输入减少到一个记录 - 如果可能的话,总 "time in zone" 和 "last seen inside" 的总和将转移到变异数据中,这会很好 - 但如果这太复杂,也不要太担心。
所以 Ship A 只会显示:
Name Type First seen inside Last seen inside Time in zone DOS
A Fishing 4/27/2019 12:16 4/27/2019 12:42 21m 0-12
但我很乐意将其减少到其中一行,如果太多,则不必更正最后一次看到的区域时间。
对于C船,因为它是一艘货船,我不想像钓鱼一样对待它,即使有多个文件,我也想保留所有的文件数据每天
对于船 E,因为它出现在三个不同的日子,我希望它有三个数据条目...
我希望这有点道理?我不确定这是否是基于同一天乘法的 dplyr 或 mutate
上可能的 filter
选项?关于如何管理这个 "problem" 的任何建议都会很棒......或者我可能需要对数据集做一些手动工作:(
df %>% group_by(Name,DOS,as.Date(`First seen inside`)) %>%
filter(Type=="Fishing") %>%
summarize(last=max(as.Date(`Last seen inside`, format="%m/%d/%Y")))
是这样的吗?结果:
# A tibble: 10 x 4
# Groups: Name, DOS [6]
Name DOS `as.Date(\`First seen inside\`)` last
<fct> <fct> <date> <date>
1 A 0-12 2019-04-27 2019-04-27
2 B 0-12 2019-04-23 2019-04-23
3 B 0-12 2019-04-26 2019-04-26
4 B 0-12 2019-04-29 2019-04-29
5 D 0-12 2019-04-26 2019-04-27
6 E 0-12 2019-04-25 2019-04-25
7 E 0-12 2019-04-27 2019-04-27
8 G 0-12 2019-04-24 2019-04-24
9 G 0-12 2019-04-25 2019-04-25
10 I 0-12 2019-04-26 2019-04-29
我有一个不同地区不同船只的数据集。我得到的数据输出记录了船只的名称、类型(例如 fishing/cargo)以及它进入该区域的时间、它离开的时间以及它在该区域的持续时间/ DOS 只是离岸距离 - 或区域 i正在看
我的问题是渔船经常横断面,一天内会多次进出该区域,因此会在我的报告输出中多次注明。
我想合并渔船数据,这样如果同名船(仅适用于类型:捕鱼)每天被记录不止一次,除了一个帐户之外的所有帐户都会被删除。为简单起见,也许只看一下 "First seen in zone date",因为我认为当特定持续时间跨越多天时它会变得更加复杂(我可以稍后再回到那个想法)。
虚拟数据:
df <- structure(list(Name = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 8L,
8L, 9L), .Label = c("A", "B", "C", "D", "E", "F", "G", "H", "I"
), class = "factor"), Type = structure(c(2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L,
2L, 1L, 1L, 2L), .Label = c("Cargo", "Fishing"), class = "factor"),
`First seen inside` = structure(c(1556385360, 1556393640,
1556002200, 1556260260, 1556518860, 1556136660, 1556278500,
1556285820, 1556391480, 1556509620, 1556319480, 1556214120,
1556235600, 1556325540, 1556326920, 1556329500, 1556330220,
1556330580, 1556330880, 1556330940, 1556332980, 1556339880,
1556340900, 1556344140, 1556344500, 1556345220, 1556346420,
1556348220, 1556348520, 1556350860, 1556351460, 1556356620,
1556360220, 1556365920, 1556366520, 1556367180, 1556076420,
1556166900, 1556154840, 1556454900, 1556291220), class = c("POSIXct",
"POSIXt"), tzone = ""), `Last seen inside` = structure(c(34L,
35L, 1L, 8L, 38L, 3L, 7L, 9L, 36L, 38L, 27L, 4L, 5L, 10L,
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 26L, 28L, 29L, 30L, 31L, 32L, 33L, 2L, 6L,
37L, 38L, 38L), .Label = c("4/23/2019 14:27", "4/24/2019 21:23",
"4/25/2019 00:00", "4/25/2019 10:47", "4/25/2019 16:59",
"4/25/2019 23:49", "4/26/2019 05:17", "4/26/2019 13:39",
"4/26/2019 15:12", "4/26/2019 17:54", "4/26/2019 18:05",
"4/26/2019 18:51", "4/26/2019 19:00", "4/26/2019 19:06",
"4/26/2019 19:08", "4/26/2019 19:13", "4/26/2019 21:24",
"4/26/2019 21:38", "4/26/2019 22:02", "4/26/2019 22:51",
"4/26/2019 22:55", "4/26/2019 23:22", "4/26/2019 23:51",
"4/27/2019 00:00", "4/27/2019 00:36", "4/27/2019 00:42",
"4/27/2019 01:17", "4/27/2019 02:06", "4/27/2019 03:11",
"4/27/2019 04:30", "4/27/2019 05:00", "4/27/2019 05:03",
"4/27/2019 05:13", "4/27/2019 10:29", "4/27/2019 12:42",
"4/27/2019 17:21", "4/28/2019 03:47", "4/29/2019 09:56"), class =
"factor"),
`Time in zone` = structure(c(5L, 31L, 6L, 7L, 2L, 3L, 23L,
30L, 26L, 4L, 32L, 27L, 9L, 8L, 22L, 28L, 22L, 22L, 1L, 24L,
15L, 1L, 29L, 18L, 1L, 8L, 17L, 22L, 19L, 16L, 14L, 25L,
13L, 31L, 16L, 1L, 12L, 10L, 21L, 11L, 20L), .Label = c("",
"10h 35m", "10h 49m", "13h 9m", "13m", "14h 37m", "14h 8m",
"15m", "19m", "1d 2h 14m", "1d 4h 21m", "1d 56m", "1h 13m",
"1h 15m", "1h 41m", "1m", "24m", "2m", "34m", "3d 1h 49m",
"3d 9h 33m", "3m", "42m", "4m", "54m", "5h 23m", "5m", "6m",
"7m", "8h 35m", "8m", "9h 19m"), class = "factor"), DOS =
structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "0-12", class =
"factor")), row.names = c(NA,
-41L), class = "data.frame")
所以如果例如在我的虚拟数据集中:
由于船 "A" 是 DOS 0-12 中的一艘 "Fishing" 船只,它在 4 月 27 日出现两次,我想将数据输入减少到一个记录 - 如果可能的话,总 "time in zone" 和 "last seen inside" 的总和将转移到变异数据中,这会很好 - 但如果这太复杂,也不要太担心。 所以 Ship A 只会显示:
Name Type First seen inside Last seen inside Time in zone DOS A Fishing 4/27/2019 12:16 4/27/2019 12:42 21m 0-12
但我很乐意将其减少到其中一行,如果太多,则不必更正最后一次看到的区域时间。
对于C船,因为它是一艘货船,我不想像钓鱼一样对待它,即使有多个文件,我也想保留所有的文件数据每天
对于船 E,因为它出现在三个不同的日子,我希望它有三个数据条目...
我希望这有点道理?我不确定这是否是基于同一天乘法的 dplyr 或 mutate
上可能的 filter
选项?关于如何管理这个 "problem" 的任何建议都会很棒......或者我可能需要对数据集做一些手动工作:(
df %>% group_by(Name,DOS,as.Date(`First seen inside`)) %>%
filter(Type=="Fishing") %>%
summarize(last=max(as.Date(`Last seen inside`, format="%m/%d/%Y")))
是这样的吗?结果:
# A tibble: 10 x 4
# Groups: Name, DOS [6]
Name DOS `as.Date(\`First seen inside\`)` last
<fct> <fct> <date> <date>
1 A 0-12 2019-04-27 2019-04-27
2 B 0-12 2019-04-23 2019-04-23
3 B 0-12 2019-04-26 2019-04-26
4 B 0-12 2019-04-29 2019-04-29
5 D 0-12 2019-04-26 2019-04-27
6 E 0-12 2019-04-25 2019-04-25
7 E 0-12 2019-04-27 2019-04-27
8 G 0-12 2019-04-24 2019-04-24
9 G 0-12 2019-04-25 2019-04-25
10 I 0-12 2019-04-26 2019-04-29