在 R 中拆分多个日期和时间变量并计算时间平均值
Splitting multiple date and time variables & computing time average in R
我有以下数据集,其中,我有此人的 ID、他们居住的地区和街道,以及他们将数据上传到服务器的最后 date/time。变量“last_down_”包含一个人上传数据的最后一个 date/time,并以显示我下载数据的日期的方式命名。例如,“last_upload_2020-06-12”表示我在 6 月 12 日从服务器下载了数据。
对于下面的数据集,我想将日期和时间溢出到每个变量中(一次全部),新创建的分隔变量的名称为“last_date_ (我的下载日期)" & "last_time_(我的下载日期)"
district block id last_upload_2020-06-12 last_upload_2020-06-13 last_upload_2020-06-14 last_upload_2020-06-15
A X 11 2020-02-06 11:53:19.0 2020-02-06 11:53:19.0 2020-02-06 11:53:19.0 2020-02-06 11:53:19.0
A X 12 2020-06-11 12:40:26.0 2020-06-11 12:40:26.0 2020-06-14 11:40:26.0 2020-06-15 18:50:26.0
A X 2020-06-14 11:08:12.0 2020-06-14 11:08:12.0
A X 14 2020-06-12 11:31:07.0 2020-06-13 11:31:07.0 2020-06-14 17:37:07.0 2020-06-14 17:37:07.0
A Y 15 2020-06-10 12:45:48.0 2020-06-10 12:45:48.0 2020-06-10 12:45:48.0 2020-06-10 12:45:48.0
A Y 16 2020-04-04 02:26:57.0 2020-04-04 02:26:57.0 2020-04-04 02:26:57.0 2020-04-04 02:26:57.0
A Y 17 2020-03-31 08:10:03.0 2020-03-31 08:10:03.0 2020-03-31 08:10:03.0 2020-03-31 08:10:03.0
A Y 18 2020-05-30 12:08:15.0 2020-05-30 12:08:15.0 2020-05-30 12:08:15.0 2020-05-30 12:08:15.0
A Z 19 2020-04-09 15:21:52.0 2020-04-09 15:21:52.0 2020-04-09 15:21:52.0 2020-04-09 15:21:52.0
A Z 20 2020-05-30 17:42:33.0 2020-05-30 17:42:33.0 2020-05-30 17:42:33.0 2020-05-30 17:42:33.0
A Z 21 2020-04-12 14:23:29.0 2020-04-12 14:23:29.0 2020-04-12 14:23:29.0 2020-04-12 14:23:29.0
A Z 22 2020-05-13 23:18:19.0 2020-05-13 23:18:19.0 2020-05-13 23:18:19.0 2020-05-13 23:18:19.0
A X 23 2020-04-30 09:53:31.0 2020-04-30 09:53:31.0 2020-04-30 09:53:31.0 2020-04-30 09:53:31.0
A X 24 2020-06-10 10:28:59.0 2020-06-10 10:28:59.0 2020-06-10 10:28:59.0 2020-06-15 11:31:33.0
A Y 25
A Y 26 2020-05-30 12:14:09.0 2020-05-30 12:14:09.0 2020-05-30 12:14:09.0 2020-05-30 12:14:09.0
B E 31
B C 32 2020-06-12 16:43:23.0 2020-06-12 16:43:23.0 2020-06-12 16:43:23.0 2020-06-12 16:43:23.0
B C 33 2019-10-24 22:30:35.0 2019-10-24 22:30:35.0 2019-10-24 22:30:35.0 2019-10-24 22:30:35.0
B C 34 2020-06-09 15:38:18.0 2020-06-09 15:38:18.0 2020-06-09 15:38:18.0 2020-06-15 14:35:41.0
B C 35 2020-06-11 14:39:51.0 2020-06-11 14:39:51.0 2020-06-11 14:39:51.0 2020-06-11 14:39:51.0
B D 36 2020-06-12 11:53:15.0 2020-06-12 11:53:15.0 2020-06-12 11:53:15.0 2020-06-15 13:02:39.0
B D 37 2020-04-21 15:43:43.0 2020-04-21 15:43:43.0 2020-04-21 15:43:43.0 2020-04-21 15:43:43.0
B D 38 2020-05-13 04:07:17.0 2020-05-13 04:07:17.0 2020-05-13 04:07:17.0 2020-05-13 04:07:17.0
B E 39 2020-04-30 13:51:20.0 2020-04-30 13:51:20.0 2020-04-30 13:51:20.0 2020-04-30 13:51:20.0
B E 40 2020-05-12 16:51:01.0 2020-05-12 16:51:01.0 2020-05-12 16:51:01.0 2020-05-12 16:51:01.0
B E 41 2020-04-16 12:14:24.0 2020-04-16 12:14:24.0 2020-04-16 12:14:24.0 2020-04-16 12:14:24.0
B C 42 2018-06-07 15:12:18.0 2018-06-07 15:12:18.0 2018-06-07 15:12:18.0 2018-06-07 15:12:18.0
B D 43 2019-09-28 10:08:51.0 2019-09-28 10:08:51.0 2019-09-28 10:08:51.0 2019-09-28 10:08:51.0
N.B:我的 date/time 变量是数字。
一旦我得到了数据,我还想做以下事情:
在单独的列中获取“last_upload_2020-06-12”下所有观测值的年份和月份。
同样,我数据集中的最后一个日期是“last_upload_2020-06-15”。我可以让 R 自动选择最后一个日期,比如 Sys.Date()-1 吗?我将始终拥有比当前日期少一个日期的数据。
计算每个ID的平均上传时间,即一般人在什么时间左右上传数据到服务器?平均值应基于独特的时间值。
如果有人可以帮助解决这个问题,将非常有帮助!
谢谢,
拉奇塔
您可以在您的原始数据集中尝试以下代码。这可能会帮助您回答问题的介绍、第一部分、第三部分和最后的第二部分。
library(lubridate)
library(tidyverse)
district <- c("A","A","B","B","C","C")
block <- c("X","Y","Z","X","Y","Z")
id <- c(11,11,12,12,13,13)
upload_dt <- ymd_hms(c("2020-06-13 11:31:07",
"2020-04-12 14:23:29",
"2020-04-30 13:51:20",
"2020-06-12 11:53:15",
"2019-09-28 02:08:51",
"2020-04-12 16:23:29"))
df <- data.frame(district, block, id, upload_dt)
df <- df %>%
separate(upload_dt, into = c("date","time"),
sep = " ", remove = F)
df$upload_date <- paste("last_upload_date_is", df$date)
df$upload_time <- paste("last_upload_time_is", df$time)
df <- df %>%
mutate(date_added = ymd(df$date),
year_upload = year(date),
month_upload = month(date))
df
引言和问题第一部分的输出如下:-
district block id upload_dt date time upload_date
1 A X 11 2020-06-13 11:31:07 2020-06-13 11:31:07 last_upload_date_is 2020-06-13
2 A Y 11 2020-04-12 14:23:29 2020-04-12 14:23:29 last_upload_date_is 2020-04-12
3 B Z 12 2020-04-30 13:51:20 2020-04-30 13:51:20 last_upload_date_is 2020-04-30
4 B X 12 2020-06-12 11:53:15 2020-06-12 11:53:15 last_upload_date_is 2020-06-12
5 C Y 13 2019-09-28 02:08:51 2019-09-28 02:08:51 last_upload_date_is 2019-09-28
6 C Z 13 2020-04-12 16:23:29 2020-04-12 16:23:29 last_upload_date_is 2020-04-12
upload_time date_added year_upload month_upload
1 last_upload_time_is 11:31:07 2020-06-13 2020 6
2 last_upload_time_is 14:23:29 2020-04-12 2020 4
3 last_upload_time_is 13:51:20 2020-04-30 2020 4
4 last_upload_time_is 11:53:15 2020-06-12 2020 6
5 last_upload_time_is 02:08:51 2019-09-28 2019 9
6 last_upload_time_is 16:23:29 2020-04-12 2020 4
第三部分问题的代码和输出如下:-
df %>% group_by(id) %>%
summarise(avg_time_per_id = format(mean(strptime(time, "%H:%M:%S")), "%H:%M:%S")) %>%
ungroup()
# A tibble: 3 x 2
id avg_time_per_id
<dbl> <chr>
1 11 12:57:18
2 12 12:52:17
3 13 09:16:10
问题第二部分的代码和输出如下:-
(注意我已经创建了一个新的数据框。)您可以将此解决方案应用于现有数据集。
df <- data.frame(
id = c(1:5),
district = c("X","Y","X","Y","X"),
block = c("A","A","B","B","C"),
upload_date_a = paste0(rep("2020-06-13"), " ", rep("11:31:07")),
upload_date_b = paste0(rep("2010-08-15"), " ", rep("02:45:27")),
upload_date_c = paste0(rep("2000-10-30"), " ", rep("16:45:51")),
stringsAsFactors = F
)
col_ind <- grep(x = names(df), pattern = "upload_date", value = T, ignore.case = T)
cols_list <- lapply(seq_along(col_ind), function(x){
q1 <- do.call(rbind, strsplit(df[[col_ind[[x]]]], split = " "))
q2 <- data.frame(q1, stringsAsFactors = F)
i <- ncol(q2)
colnames(q2) <- paste0(col_ind[[x]], c(1:i))
return(q2)
}
)
df_new <- cbind(df[1:3], do.call(cbind, cols_list))
df_new
id district block upload_date_a1 upload_date_a2 upload_date_b1
1 1 X A 2020-06-13 11:31:07 2010-08-15
2 2 Y A 2020-06-13 11:31:07 2010-08-15
3 3 X B 2020-06-13 11:31:07 2010-08-15
4 4 Y B 2020-06-13 11:31:07 2010-08-15
5 5 X C 2020-06-13 11:31:07 2010-08-15
upload_date_b2 upload_date_c1 upload_date_c2
1 02:45:27 2000-10-30 16:45:51
2 02:45:27 2000-10-30 16:45:51
3 02:45:27 2000-10-30 16:45:51
4 02:45:27 2000-10-30 16:45:51
5 02:45:27 2000-10-30 16:45:51
Df 看起来很复杂,我认为复制它可能会更好。
然后我使用一个函数来获取你想要的每一列,并根据需要将它分成 last_date 和 last_time 。在函数内部,临时 DF 是 cbind
到循环外构建的 DF。该 DF 由未在循环中处理的列组成。
此循环的结果是所需的 DF。 [colnames 有点长]
第二个任务的重点是转last_time
小时,然后分组总结。
希望这就是你想要的。
我觉得以此为基础你可以对付no2
有一些与 NA 相关的警告。
在下面的 reprex 中有更多解释。
library(tidyverse)
df <- read.table(text = '
district block id last_upload_2020_06_12 last_upload_2020_06_13 last_upload_2020_06_14 last_upload_2020_06_15
"A" "X" 11 "2020-02-06 11:53:19.0" "2020-02-06 11:53:19.0" "2020-02-06 11:53:19.0" "2020-02-06 11:53:19.0"
"A" "X" 12 "2020-06-11 12:40:26.0" "2020-06-11 12:40:26.0" "2020-06-14 11:40:26.0" "2020-06-15 18:50:26.0"
"A" "X" NA "NA" "NA" "2020-06-14 11:0812.0" "2020-06-14 11:0812.0"
"A" "X" 14 "2020-06-12 11:31:07.0" "2020-06-13 11:31:07.0" "2020-06-14 17:37:07.0" "2020-06-14 17:37:07.0"
"A" "Y" 15 "2020-06-10 12:45:48.0" "2020-06-10 12:45:48.0" "2020-06-10 12:45:48.0" "2020-06-10 12:45:48.0"
"A" "Y" 16 "2020-04-04 02:26:57.0" "2020-04-04 02:26:57.0" "2020-04-04 02:26:57.0" "2020-04-04 02:26:57.0"
"A" "Y" 17 "2020-03-31 08:10:03.0" "2020-03-31 08:10:03.0" "2020-03-31 08:10:03.0" "2020-03-31 08:10:03.0"
"A" "Y" 18 "2020-05-30 12:08:15.0" "2020-05-30 12:08:15.0" "2020-05-30 12:08:15.0" "2020-05-30 12:08:15.0"
"A" "Z" 19 "2020-04-09 15:21:52.0" "2020-04-09 15:21:52.0" "2020-04-09 15:21:52.0" "2020-04-09 15:21:52.0"
"A" "Z" 20 "2020-05-30 17:42:33.0" "2020-05-30 17:42:33.0" "2020-05-30 17:42:33.0" "2020-05-30 17:42:33.0"
"A" "Z" 21 "2020-04-12 14:23:29.0" "2020-04-12 14:23:29.0" "2020-04-12 14:23:29.0" "2020-04-12 14:23:29.0"
"A" "Z" 22 "2020-05-13 23:18:19.0" "2020-05-13 23:18:19.0" "2020-05-13 23:18:19.0" "2020-05-13 23:18:19.0"
"A" "X" 23 "2020-04-30 09:53:31.0" "2020-04-30 09:53:31.0" "2020-04-30 09:53:31.0" "2020-04-30 09:53:31.0"
"A" "X" 24 "2020-06-10 10:28:59.0" "2020-06-10 10:28:59.0" "2020-06-10 10:28:59.0" "2020-06-15 11:31:33.0"
"A" "Y" 25 " " "" "" ""
"A" "Y" 26 "2020-05-3012:14:09.0" "2020-05-30 12:14:09.0" "2020-05-30 12:14:09.0" "2020-05-30 12:14:09.0"
"B" "E" 31 "" "" "" "" ""
"B" "C" 32 "2020-06-1 16:43:23.0" "2020-06-12 16:43:23.0" "2020-06-12 16:43:23.0" "2020-06-12 16:43:23.0"
"B" "C" 33 "2019-10-24 22:30:35.0" "2019-10-24 22:30:35.0" "2019-10-24 22:30:35.0" "2019-10-24 22:30:35.0"
"B" "C" 34 "2020-06-09 15:38:18.0" "2020-06-09 15:38:18.0" "2020-06-09 15:38:18.0" "2020-06-15 14:35:41.0"
"B" "C" 35 "2020-06-11 14:39:51.0" "2020-06-11 14:39:51.0" "2020-06-11 14:39:51.0" "2020-06-11 14:39:51.0"
"B" "D" 36 "2020-06-12 11:53:15.0" "2020-06-12 11:53:15.0" "2020-06-12 11:53:15.0" "2020-06-15 13:02:39.0"
"B" "D" 37 "2020-04-21 15:43:43.0" "2020-04-21 15:43:43.0" "2020-04-21 15:43:43.0" "2020-04-21 15:43:43.0"
"B" "D" 38 "2020-05-13 04:07:17.0" "2020-05-13 04:07:17.0" "2020-05-13 04:07:17.0" "2020-05-13 04:07:17.0"
"B" "E" 39 "2020-04-30 13:51:20.0" "2020-04-30 13:51:20.0" "2020-04-30 13:51:20.0" "2020-04-30 13:51:20.0"
"B" "E" 40 "2020-05-12 16:51:01.0" "2020-05-12 16:51:01.0" "2020-05-12 16:51:01.0" "2020-05-12 16:51:01.0"
"B" "E" 41 "2020-04-16 12:14:24.0" "2020-04-16 12:14:24.0" "2020-04-16 12:14:24.0" "2020-04-16 12:14:24.0"
"B" "C" 42 "2018-06-07 15:12:18.0" "2018-06-07 15:12:18.0" "2018-06-07 15:12:18.0" "2018-06-07 15:12:18.0"
"B" "D" 43 "2019-09-28 10:08:51.0" "2019-09-28 10:08:51.0" "2019-09-28 10:08:51.0" "2019-09-28 10:08:51.0"
', header =T)
# TASK: create for each column which contains 'last_upload' new columns
# with date and time
# get the colnames of the cols to be split or separated
ccl <- colnames(df %>% select(last_upload_2020_06_12:last_upload_2020_06_15))
# create new DF with first 3 columns, to which other columns are bound in
# the following function
dff <- df %>% select(district:id)
# function to separate each col in ccl to _date and _time
for (cl in ccl) {
tmp <- separate(df,
col = cl, sep = " ",
into = c(paste0(cl, "_date"), paste0(cl, "_time"))
) %>%
select(contains("_date") | contains("_time"))
dff <- cbind(dff, tmp)
}
dff %>% head()
#> district block id last_upload_2020_06_12_date last_upload_2020_06_12_time
#> 1 A X 11 2020-02-06 11:53:19.0
#> 2 A X 12 2020-06-11 12:40:26.0
#> 3 A X NA <NA> <NA>
#> 4 A X 14 2020-06-12 11:31:07.0
#> 5 A Y 15 2020-06-10 12:45:48.0
#> 6 A Y 16 2020-04-04 02:26:57.0
#> last_upload_2020_06_13_date last_upload_2020_06_13_time
#> 1 2020-02-06 11:53:19.0
#> 2 2020-06-11 12:40:26.0
#> 3 <NA> <NA>
#> 4 2020-06-13 11:31:07.0
#> 5 2020-06-10 12:45:48.0
#> 6 2020-04-04 02:26:57.0
#> last_upload_2020_06_14_date last_upload_2020_06_14_time
#> 1 2020-02-06 11:53:19.0
#> 2 2020-06-14 11:40:26.0
#> 3 2020-06-14 11:0812.0
#> 4 2020-06-14 17:37:07.0
#> 5 2020-06-10 12:45:48.0
#> 6 2020-04-04 02:26:57.0
#> last_upload_2020_06_15_date last_upload_2020_06_15_time
#> 1 2020-02-06 11:53:19.0
#> 2 2020-06-15 18:50:26.0
#> 3 2020-06-14 11:0812.0
#> 4 2020-06-14 17:37:07.0
#> 5 2020-06-10 12:45:48.0
#> 6 2020-04-04 02:26:57.0
# TASK: Calculate the average time of a day each id does a download
# new DF from original brought into long format
# split the date/time into last_date and last_time
ddf <- df %>%
pivot_longer(cols = last_upload_2020_06_12:last_upload_2020_06_15) %>%
separate(col = value, sep = ' ', into = c('last_date', 'last_time')) %>%
mutate(last_date = lubridate::ymd(last_date), last_time= lubridate::hms(last_time))
# calculating the mean hour of the day at which each id does a
# download, by calculating last_time to hours (of the day) and
# after grouping build mean hour
ddf %>%
mutate(hours = as.numeric(lubridate::hms(last_time), unit = 'hour')) %>%
group_by(id) %>% summarise(meanHourOfTheDay = mean(hours, na.rm = T))
#> # A tibble: 29 x 2
#> id meanHourOfTheDay
#> <int> <dbl>
#> 1 11 11.9
#> 2 12 14.0
#> 3 14 14.6
#> 4 15 12.8
#> 5 16 2.45
#> 6 17 8.17
#> 7 18 12.1
#> 8 19 15.4
#> 9 20 17.7
#> 10 21 14.4
#> # … with 19 more rows
我有以下数据集,其中,我有此人的 ID、他们居住的地区和街道,以及他们将数据上传到服务器的最后 date/time。变量“last_down_”包含一个人上传数据的最后一个 date/time,并以显示我下载数据的日期的方式命名。例如,“last_upload_2020-06-12”表示我在 6 月 12 日从服务器下载了数据。
对于下面的数据集,我想将日期和时间溢出到每个变量中(一次全部),新创建的分隔变量的名称为“last_date_ (我的下载日期)" & "last_time_(我的下载日期)"
district block id last_upload_2020-06-12 last_upload_2020-06-13 last_upload_2020-06-14 last_upload_2020-06-15
A X 11 2020-02-06 11:53:19.0 2020-02-06 11:53:19.0 2020-02-06 11:53:19.0 2020-02-06 11:53:19.0
A X 12 2020-06-11 12:40:26.0 2020-06-11 12:40:26.0 2020-06-14 11:40:26.0 2020-06-15 18:50:26.0
A X 2020-06-14 11:08:12.0 2020-06-14 11:08:12.0
A X 14 2020-06-12 11:31:07.0 2020-06-13 11:31:07.0 2020-06-14 17:37:07.0 2020-06-14 17:37:07.0
A Y 15 2020-06-10 12:45:48.0 2020-06-10 12:45:48.0 2020-06-10 12:45:48.0 2020-06-10 12:45:48.0
A Y 16 2020-04-04 02:26:57.0 2020-04-04 02:26:57.0 2020-04-04 02:26:57.0 2020-04-04 02:26:57.0
A Y 17 2020-03-31 08:10:03.0 2020-03-31 08:10:03.0 2020-03-31 08:10:03.0 2020-03-31 08:10:03.0
A Y 18 2020-05-30 12:08:15.0 2020-05-30 12:08:15.0 2020-05-30 12:08:15.0 2020-05-30 12:08:15.0
A Z 19 2020-04-09 15:21:52.0 2020-04-09 15:21:52.0 2020-04-09 15:21:52.0 2020-04-09 15:21:52.0
A Z 20 2020-05-30 17:42:33.0 2020-05-30 17:42:33.0 2020-05-30 17:42:33.0 2020-05-30 17:42:33.0
A Z 21 2020-04-12 14:23:29.0 2020-04-12 14:23:29.0 2020-04-12 14:23:29.0 2020-04-12 14:23:29.0
A Z 22 2020-05-13 23:18:19.0 2020-05-13 23:18:19.0 2020-05-13 23:18:19.0 2020-05-13 23:18:19.0
A X 23 2020-04-30 09:53:31.0 2020-04-30 09:53:31.0 2020-04-30 09:53:31.0 2020-04-30 09:53:31.0
A X 24 2020-06-10 10:28:59.0 2020-06-10 10:28:59.0 2020-06-10 10:28:59.0 2020-06-15 11:31:33.0
A Y 25
A Y 26 2020-05-30 12:14:09.0 2020-05-30 12:14:09.0 2020-05-30 12:14:09.0 2020-05-30 12:14:09.0
B E 31
B C 32 2020-06-12 16:43:23.0 2020-06-12 16:43:23.0 2020-06-12 16:43:23.0 2020-06-12 16:43:23.0
B C 33 2019-10-24 22:30:35.0 2019-10-24 22:30:35.0 2019-10-24 22:30:35.0 2019-10-24 22:30:35.0
B C 34 2020-06-09 15:38:18.0 2020-06-09 15:38:18.0 2020-06-09 15:38:18.0 2020-06-15 14:35:41.0
B C 35 2020-06-11 14:39:51.0 2020-06-11 14:39:51.0 2020-06-11 14:39:51.0 2020-06-11 14:39:51.0
B D 36 2020-06-12 11:53:15.0 2020-06-12 11:53:15.0 2020-06-12 11:53:15.0 2020-06-15 13:02:39.0
B D 37 2020-04-21 15:43:43.0 2020-04-21 15:43:43.0 2020-04-21 15:43:43.0 2020-04-21 15:43:43.0
B D 38 2020-05-13 04:07:17.0 2020-05-13 04:07:17.0 2020-05-13 04:07:17.0 2020-05-13 04:07:17.0
B E 39 2020-04-30 13:51:20.0 2020-04-30 13:51:20.0 2020-04-30 13:51:20.0 2020-04-30 13:51:20.0
B E 40 2020-05-12 16:51:01.0 2020-05-12 16:51:01.0 2020-05-12 16:51:01.0 2020-05-12 16:51:01.0
B E 41 2020-04-16 12:14:24.0 2020-04-16 12:14:24.0 2020-04-16 12:14:24.0 2020-04-16 12:14:24.0
B C 42 2018-06-07 15:12:18.0 2018-06-07 15:12:18.0 2018-06-07 15:12:18.0 2018-06-07 15:12:18.0
B D 43 2019-09-28 10:08:51.0 2019-09-28 10:08:51.0 2019-09-28 10:08:51.0 2019-09-28 10:08:51.0
N.B:我的 date/time 变量是数字。
一旦我得到了数据,我还想做以下事情:
在单独的列中获取“last_upload_2020-06-12”下所有观测值的年份和月份。
同样,我数据集中的最后一个日期是“last_upload_2020-06-15”。我可以让 R 自动选择最后一个日期,比如 Sys.Date()-1 吗?我将始终拥有比当前日期少一个日期的数据。
计算每个ID的平均上传时间,即一般人在什么时间左右上传数据到服务器?平均值应基于独特的时间值。
如果有人可以帮助解决这个问题,将非常有帮助!
谢谢, 拉奇塔
您可以在您的原始数据集中尝试以下代码。这可能会帮助您回答问题的介绍、第一部分、第三部分和最后的第二部分。
library(lubridate)
library(tidyverse)
district <- c("A","A","B","B","C","C")
block <- c("X","Y","Z","X","Y","Z")
id <- c(11,11,12,12,13,13)
upload_dt <- ymd_hms(c("2020-06-13 11:31:07",
"2020-04-12 14:23:29",
"2020-04-30 13:51:20",
"2020-06-12 11:53:15",
"2019-09-28 02:08:51",
"2020-04-12 16:23:29"))
df <- data.frame(district, block, id, upload_dt)
df <- df %>%
separate(upload_dt, into = c("date","time"),
sep = " ", remove = F)
df$upload_date <- paste("last_upload_date_is", df$date)
df$upload_time <- paste("last_upload_time_is", df$time)
df <- df %>%
mutate(date_added = ymd(df$date),
year_upload = year(date),
month_upload = month(date))
df
引言和问题第一部分的输出如下:-
district block id upload_dt date time upload_date
1 A X 11 2020-06-13 11:31:07 2020-06-13 11:31:07 last_upload_date_is 2020-06-13
2 A Y 11 2020-04-12 14:23:29 2020-04-12 14:23:29 last_upload_date_is 2020-04-12
3 B Z 12 2020-04-30 13:51:20 2020-04-30 13:51:20 last_upload_date_is 2020-04-30
4 B X 12 2020-06-12 11:53:15 2020-06-12 11:53:15 last_upload_date_is 2020-06-12
5 C Y 13 2019-09-28 02:08:51 2019-09-28 02:08:51 last_upload_date_is 2019-09-28
6 C Z 13 2020-04-12 16:23:29 2020-04-12 16:23:29 last_upload_date_is 2020-04-12
upload_time date_added year_upload month_upload
1 last_upload_time_is 11:31:07 2020-06-13 2020 6
2 last_upload_time_is 14:23:29 2020-04-12 2020 4
3 last_upload_time_is 13:51:20 2020-04-30 2020 4
4 last_upload_time_is 11:53:15 2020-06-12 2020 6
5 last_upload_time_is 02:08:51 2019-09-28 2019 9
6 last_upload_time_is 16:23:29 2020-04-12 2020 4
第三部分问题的代码和输出如下:-
df %>% group_by(id) %>%
summarise(avg_time_per_id = format(mean(strptime(time, "%H:%M:%S")), "%H:%M:%S")) %>%
ungroup()
# A tibble: 3 x 2
id avg_time_per_id
<dbl> <chr>
1 11 12:57:18
2 12 12:52:17
3 13 09:16:10
问题第二部分的代码和输出如下:-
(注意我已经创建了一个新的数据框。)您可以将此解决方案应用于现有数据集。
df <- data.frame(
id = c(1:5),
district = c("X","Y","X","Y","X"),
block = c("A","A","B","B","C"),
upload_date_a = paste0(rep("2020-06-13"), " ", rep("11:31:07")),
upload_date_b = paste0(rep("2010-08-15"), " ", rep("02:45:27")),
upload_date_c = paste0(rep("2000-10-30"), " ", rep("16:45:51")),
stringsAsFactors = F
)
col_ind <- grep(x = names(df), pattern = "upload_date", value = T, ignore.case = T)
cols_list <- lapply(seq_along(col_ind), function(x){
q1 <- do.call(rbind, strsplit(df[[col_ind[[x]]]], split = " "))
q2 <- data.frame(q1, stringsAsFactors = F)
i <- ncol(q2)
colnames(q2) <- paste0(col_ind[[x]], c(1:i))
return(q2)
}
)
df_new <- cbind(df[1:3], do.call(cbind, cols_list))
df_new
id district block upload_date_a1 upload_date_a2 upload_date_b1
1 1 X A 2020-06-13 11:31:07 2010-08-15
2 2 Y A 2020-06-13 11:31:07 2010-08-15
3 3 X B 2020-06-13 11:31:07 2010-08-15
4 4 Y B 2020-06-13 11:31:07 2010-08-15
5 5 X C 2020-06-13 11:31:07 2010-08-15
upload_date_b2 upload_date_c1 upload_date_c2
1 02:45:27 2000-10-30 16:45:51
2 02:45:27 2000-10-30 16:45:51
3 02:45:27 2000-10-30 16:45:51
4 02:45:27 2000-10-30 16:45:51
5 02:45:27 2000-10-30 16:45:51
Df 看起来很复杂,我认为复制它可能会更好。
然后我使用一个函数来获取你想要的每一列,并根据需要将它分成 last_date 和 last_time 。在函数内部,临时 DF 是 cbind
到循环外构建的 DF。该 DF 由未在循环中处理的列组成。
此循环的结果是所需的 DF。 [colnames 有点长]
第二个任务的重点是转last_time
小时,然后分组总结。
希望这就是你想要的。
我觉得以此为基础你可以对付no2
有一些与 NA 相关的警告。
在下面的 reprex 中有更多解释。
library(tidyverse)
df <- read.table(text = '
district block id last_upload_2020_06_12 last_upload_2020_06_13 last_upload_2020_06_14 last_upload_2020_06_15
"A" "X" 11 "2020-02-06 11:53:19.0" "2020-02-06 11:53:19.0" "2020-02-06 11:53:19.0" "2020-02-06 11:53:19.0"
"A" "X" 12 "2020-06-11 12:40:26.0" "2020-06-11 12:40:26.0" "2020-06-14 11:40:26.0" "2020-06-15 18:50:26.0"
"A" "X" NA "NA" "NA" "2020-06-14 11:0812.0" "2020-06-14 11:0812.0"
"A" "X" 14 "2020-06-12 11:31:07.0" "2020-06-13 11:31:07.0" "2020-06-14 17:37:07.0" "2020-06-14 17:37:07.0"
"A" "Y" 15 "2020-06-10 12:45:48.0" "2020-06-10 12:45:48.0" "2020-06-10 12:45:48.0" "2020-06-10 12:45:48.0"
"A" "Y" 16 "2020-04-04 02:26:57.0" "2020-04-04 02:26:57.0" "2020-04-04 02:26:57.0" "2020-04-04 02:26:57.0"
"A" "Y" 17 "2020-03-31 08:10:03.0" "2020-03-31 08:10:03.0" "2020-03-31 08:10:03.0" "2020-03-31 08:10:03.0"
"A" "Y" 18 "2020-05-30 12:08:15.0" "2020-05-30 12:08:15.0" "2020-05-30 12:08:15.0" "2020-05-30 12:08:15.0"
"A" "Z" 19 "2020-04-09 15:21:52.0" "2020-04-09 15:21:52.0" "2020-04-09 15:21:52.0" "2020-04-09 15:21:52.0"
"A" "Z" 20 "2020-05-30 17:42:33.0" "2020-05-30 17:42:33.0" "2020-05-30 17:42:33.0" "2020-05-30 17:42:33.0"
"A" "Z" 21 "2020-04-12 14:23:29.0" "2020-04-12 14:23:29.0" "2020-04-12 14:23:29.0" "2020-04-12 14:23:29.0"
"A" "Z" 22 "2020-05-13 23:18:19.0" "2020-05-13 23:18:19.0" "2020-05-13 23:18:19.0" "2020-05-13 23:18:19.0"
"A" "X" 23 "2020-04-30 09:53:31.0" "2020-04-30 09:53:31.0" "2020-04-30 09:53:31.0" "2020-04-30 09:53:31.0"
"A" "X" 24 "2020-06-10 10:28:59.0" "2020-06-10 10:28:59.0" "2020-06-10 10:28:59.0" "2020-06-15 11:31:33.0"
"A" "Y" 25 " " "" "" ""
"A" "Y" 26 "2020-05-3012:14:09.0" "2020-05-30 12:14:09.0" "2020-05-30 12:14:09.0" "2020-05-30 12:14:09.0"
"B" "E" 31 "" "" "" "" ""
"B" "C" 32 "2020-06-1 16:43:23.0" "2020-06-12 16:43:23.0" "2020-06-12 16:43:23.0" "2020-06-12 16:43:23.0"
"B" "C" 33 "2019-10-24 22:30:35.0" "2019-10-24 22:30:35.0" "2019-10-24 22:30:35.0" "2019-10-24 22:30:35.0"
"B" "C" 34 "2020-06-09 15:38:18.0" "2020-06-09 15:38:18.0" "2020-06-09 15:38:18.0" "2020-06-15 14:35:41.0"
"B" "C" 35 "2020-06-11 14:39:51.0" "2020-06-11 14:39:51.0" "2020-06-11 14:39:51.0" "2020-06-11 14:39:51.0"
"B" "D" 36 "2020-06-12 11:53:15.0" "2020-06-12 11:53:15.0" "2020-06-12 11:53:15.0" "2020-06-15 13:02:39.0"
"B" "D" 37 "2020-04-21 15:43:43.0" "2020-04-21 15:43:43.0" "2020-04-21 15:43:43.0" "2020-04-21 15:43:43.0"
"B" "D" 38 "2020-05-13 04:07:17.0" "2020-05-13 04:07:17.0" "2020-05-13 04:07:17.0" "2020-05-13 04:07:17.0"
"B" "E" 39 "2020-04-30 13:51:20.0" "2020-04-30 13:51:20.0" "2020-04-30 13:51:20.0" "2020-04-30 13:51:20.0"
"B" "E" 40 "2020-05-12 16:51:01.0" "2020-05-12 16:51:01.0" "2020-05-12 16:51:01.0" "2020-05-12 16:51:01.0"
"B" "E" 41 "2020-04-16 12:14:24.0" "2020-04-16 12:14:24.0" "2020-04-16 12:14:24.0" "2020-04-16 12:14:24.0"
"B" "C" 42 "2018-06-07 15:12:18.0" "2018-06-07 15:12:18.0" "2018-06-07 15:12:18.0" "2018-06-07 15:12:18.0"
"B" "D" 43 "2019-09-28 10:08:51.0" "2019-09-28 10:08:51.0" "2019-09-28 10:08:51.0" "2019-09-28 10:08:51.0"
', header =T)
# TASK: create for each column which contains 'last_upload' new columns
# with date and time
# get the colnames of the cols to be split or separated
ccl <- colnames(df %>% select(last_upload_2020_06_12:last_upload_2020_06_15))
# create new DF with first 3 columns, to which other columns are bound in
# the following function
dff <- df %>% select(district:id)
# function to separate each col in ccl to _date and _time
for (cl in ccl) {
tmp <- separate(df,
col = cl, sep = " ",
into = c(paste0(cl, "_date"), paste0(cl, "_time"))
) %>%
select(contains("_date") | contains("_time"))
dff <- cbind(dff, tmp)
}
dff %>% head()
#> district block id last_upload_2020_06_12_date last_upload_2020_06_12_time
#> 1 A X 11 2020-02-06 11:53:19.0
#> 2 A X 12 2020-06-11 12:40:26.0
#> 3 A X NA <NA> <NA>
#> 4 A X 14 2020-06-12 11:31:07.0
#> 5 A Y 15 2020-06-10 12:45:48.0
#> 6 A Y 16 2020-04-04 02:26:57.0
#> last_upload_2020_06_13_date last_upload_2020_06_13_time
#> 1 2020-02-06 11:53:19.0
#> 2 2020-06-11 12:40:26.0
#> 3 <NA> <NA>
#> 4 2020-06-13 11:31:07.0
#> 5 2020-06-10 12:45:48.0
#> 6 2020-04-04 02:26:57.0
#> last_upload_2020_06_14_date last_upload_2020_06_14_time
#> 1 2020-02-06 11:53:19.0
#> 2 2020-06-14 11:40:26.0
#> 3 2020-06-14 11:0812.0
#> 4 2020-06-14 17:37:07.0
#> 5 2020-06-10 12:45:48.0
#> 6 2020-04-04 02:26:57.0
#> last_upload_2020_06_15_date last_upload_2020_06_15_time
#> 1 2020-02-06 11:53:19.0
#> 2 2020-06-15 18:50:26.0
#> 3 2020-06-14 11:0812.0
#> 4 2020-06-14 17:37:07.0
#> 5 2020-06-10 12:45:48.0
#> 6 2020-04-04 02:26:57.0
# TASK: Calculate the average time of a day each id does a download
# new DF from original brought into long format
# split the date/time into last_date and last_time
ddf <- df %>%
pivot_longer(cols = last_upload_2020_06_12:last_upload_2020_06_15) %>%
separate(col = value, sep = ' ', into = c('last_date', 'last_time')) %>%
mutate(last_date = lubridate::ymd(last_date), last_time= lubridate::hms(last_time))
# calculating the mean hour of the day at which each id does a
# download, by calculating last_time to hours (of the day) and
# after grouping build mean hour
ddf %>%
mutate(hours = as.numeric(lubridate::hms(last_time), unit = 'hour')) %>%
group_by(id) %>% summarise(meanHourOfTheDay = mean(hours, na.rm = T))
#> # A tibble: 29 x 2
#> id meanHourOfTheDay
#> <int> <dbl>
#> 1 11 11.9
#> 2 12 14.0
#> 3 14 14.6
#> 4 15 12.8
#> 5 16 2.45
#> 6 17 8.17
#> 7 18 12.1
#> 8 19 15.4
#> 9 20 17.7
#> 10 21 14.4
#> # … with 19 more rows