如何读取格式为 %Y-%m-%d %H:%M:%OS3 的时间戳(并用它做数学运算)?

How to read in timestamps of format %Y-%m-%d %H:%M:%OS3 (and do math with it)?

我有一个 .txt 文件(没有任何明确的列分隔符),其中每一行都包含格式为 %H-%m-%d %H:%M:%OS3 的时间戳(例如“2019 -09-26 07:29:22,778") 和一个事件字符串。 我想读入数据并制作一个 table ,它在一列中显示完整的时间戳,在一列中显示事件,在第三列中显示 OS3 时间格式的时间跨度(例如“1.230”或“1,230”秒)在第 1 行中的事件和第 2 行中的事件之间,然后是第 1 行中的事件和第 3 行中的事件之间的那个等等。

我尝试在 Excel 中使用“[”作为分隔符并以 .tsv 格式保存后读取文件,这是一个不令人满意的解决方法。但是,进一步使用 dplyr difftime 函数不会导致包含毫秒的结果,尽管全局选项已设置为 3 位数秒 ("options(digits.secs=3)")。

.txt 的样子:

2019-09-26 17:54:24,406 [218] INFO  - [1] - Event X
2019-09-26 17:54:24,431 [207] INFO  - [1] - Event Y
2019-09-26 17:54:24,438 [218] INFO  - [1] - Event Z
...
.
.

我想得到什么:

timestamp                   event            timediff in sec
2019-09-26 17:54:24,406     Event X
2019-09-26 17:54:24,431     Event Y          0.025
2019-09-26 17:54:24,438     Event Z          0.032
...
.
.

给你:

df <- data.table::fread(text = "2019-09-26 17:54:24,406 [218] INFO  - [1] - Event X
2019-09-26 17:54:24,431 [207] INFO  - [1] - Event Y
2019-09-26 17:54:24,438 [218] INFO  - [1] - Event Z", sep = "[", header = FALSE) # [ seems most convenient to use as sep
colnames(df) <- c("timestamp", "garbage", "event")

df
#>                  timestamp      garbage        event
#> 1: 2019-09-26 17:54:24,406 218] INFO  - 1] - Event X
#> 2: 2019-09-26 17:54:24,431 207] INFO  - 1] - Event Y
#> 3: 2019-09-26 17:54:24,438 218] INFO  - 1] - Event Z

library(dplyr)
library(stringr)


df_clean <- df %>% 
  select(-garbage) %>% 
  mutate(timestamp = str_replace(timestamp, ",", ".")) %>%  # comma must be replaced so milliseconds are recognised
  mutate(timestamp = as.POSIXct(timestamp, format = "%Y-%m-%d %H:%M:%OS"),
         event = str_extract(event, "Event.*"),
         start_time = min(timestamp), # adding the first timestamp as new column, could be removed later
         "timediff in sec" = as.numeric(timestamp - start_time, units = "secs")) # this converts difftime to numeric


df_clean
#>             timestamp   event          start_time timediff in sec
#> 1 2019-09-26 17:54:24 Event X 2019-09-26 17:54:24      0.00000000
#> 2 2019-09-26 17:54:24 Event Y 2019-09-26 17:54:24      0.02500010
#> 3 2019-09-26 17:54:24 Event Z 2019-09-26 17:54:24      0.03200006

reprex package (v0.3.0)

于 2019-10-10 创建

您可以使用 [ 作为分隔符并使用 read.delim 读取 txt 文件。 3 位数字的问题是由于您使用逗号而不是点作为分隔符。这可以使用 str_replace(或 gsub

修复
library(dplyr)
library(stringr)

my_df <- read.delim(text = "
2019-09-26 17:54:24,406 [218] INFO  - [1] - Event X
2019-09-26 17:54:24,431 [207] INFO  - [1] - Event Y
2019-09-26 17:54:24,438 [218] INFO  - [1] - Event Z", 
sep = "[", header = FALSE, col.names = c("timestamp", "info", "event"))

my_df
#                 timestamp          info         event
# 1 2019-09-26 17:54:24,406  218] INFO  -  1] - Event X
# 2 2019-09-26 17:54:24,431  207] INFO  -  1] - Event Y
# 3 2019-09-26 17:54:24,438  218] INFO  -  1] - Event Z

my_df %>% 
  # drop the info column
  select(-info) %>% 
  mutate(# remove anything not related to the Event
         event = str_remove(event, ".*Event"), 
         # replace , with .
         timestamp = str_replace_all(timestamp, ",", "."),
         # transform to a proper timestamp
         timestamp = as.POSIXct(timestamp, format="%Y-%m-%d %H:%M:%OS"), 
         # calculate difftime (as proposed in your previous question [1])
         difftime = difftime(timestamp, timestamp[1], unit = 'sec'))
#                 timestamp event        difftime
# 1 2019-09-26 17:54:24.405     X 0.00000000 secs
# 2 2019-09-26 17:54:24.430     Y 0.02500010 secs
# 3 2019-09-26 17:54:24.437     Z 0.03200006 secs

[1]