在R中解析txt文件

Parsing txt file in R

我需要像这样解析一个 txt 文件:

2021 Sep 27 15:54:50     avg_dur     =      0.321 s
2021 Sep 27 15:54:52     avg_dur     =      0.036 s
2021 Sep 27 15:54:54     avg_dur     =      0.350 s
2021 Sep 27 15:54:56     avg_dur     =      0.317 s

我对解析 R 数据框中的日期和数字很感兴趣。 我正在尝试这样的解析器(仅针对日期):

df <- read_table("myFile.txt", col_names = FALSE, col_types = cols(X1 = col_datetime(format = "%Y %b %d %H:%M:%S")))

但是不行:

Warning: 31502 parsing failures.
row col                    expected actual                                                file
  1  X1 date like %Y %b %d %H:%M:%S   2021 'uclStats/91.211.159.43-dash_d1_gwv_vos-u5.log-avg'
  2  X1 date like %Y %b %d %H:%M:%S   2021 'uclStats/91.211.159.43-dash_d1_gwv_vos-u5.log-avg'
  3  X1 date like %Y %b %d %H:%M:%S   2021 'uclStats/91.211.159.43-dash_d1_gwv_vos-u5.log-avg'
  4  X1 date like %Y %b %d %H:%M:%S   2021 'uclStats/91.211.159.43-dash_d1_gwv_vos-u5.log-avg'
  5  X1 date like %Y %b %d %H:%M:%S   2021 'uclStats/91.211.159.43-dash_d1_gwv_vos-u5.log-avg'
... ... ........................... ...... ...................................................
See problems(...) for more details.

问题显然是它试图用整个日期时间的配方解析第一列。

在数据框中解析此 txt 文件的正确方法是什么?

此致, S.

这应该让您开始:阅读文本文件并将空格(或任何分隔列的字符串)替换为逗号(或分号等)。然后使用 text= 参数将其传递给 read.csv。然后使用许多日期解析器中的任何一个将字符串转换为日期数据类型。

1.Creating 示例数据

txt <- "2021 Sep 27 15:54:50     avg_dur     =      0.321 s
2021 Sep 27 15:54:52     avg_dur     =      0.036 s
2021 Sep 27 15:54:54     avg_dur     =      0.350 s
2021 Sep 27 15:54:56     avg_dur     =      0.317 s"

2.Read 数据使用 read_lines。在你的例子中 txt 是文本文件的路径

read.csv(text=gsub("     ",  ", ", read_lines(txt)), sep=",", header = FALSE)

Returns:

                    V1       V2 V3        V4
1 2021 Sep 27 15:54:50  avg_dur  =   0.321 s
2 2021 Sep 27 15:54:52  avg_dur  =   0.036 s
3 2021 Sep 27 15:54:54  avg_dur  =   0.350 s
4 2021 Sep 27 15:54:56  avg_dur  =   0.317 s

1) read.zoo 将其读入动物园对象,z,然后将其转换为数据框(或将其保留为动物园对象)。这利用了在转换为 POSIXct 时索引列末尾的垃圾将被忽略的事实。

为了可重复性,我们在最后的注释中使用了 Lines,但 text = Lines 可以替换为 "myFile.txt"

library(zoo)

z <- read.zoo(text = Lines, sep = "=", 
  format = "%Y %b %d %H:%M:%S", tz = "", comment.char = "s")
fortify.zoo(z)

给这个数据框有 POSIXct 和数字列:

                Index     z
1 2021-09-27 15:54:50 0.321
2 2021-09-27 15:54:52 0.036
3 2021-09-27 15:54:54 0.350
4 2021-09-27 15:54:56 0.317

2) Base R 将其读入数据框dd 然后将第一列转换为POSIXct.

dd <- read.table(text = Lines, sep = "=", comment.char = "s")
dd$V1 <- as.POSIXct(dd$V1, format = "%Y %b %d %H:%M:%S")

备注

Lines <- "2021 Sep 27 15:54:50     avg_dur     =      0.321 s
2021 Sep 27 15:54:52     avg_dur     =      0.036 s
2021 Sep 27 15:54:54     avg_dur     =      0.350 s
2021 Sep 27 15:54:56     avg_dur     =      0.317 s"