如何从 data.frame 中提取第一个和最后一个填充的列块?

how to extract the first and last filled block of columns from a data.frame?

我们有文件下载时间的日志数据。

对于每个单独的交易,都会有一个开始和结束时间戳。

excel

中的原始数据

每一行都是一个包含多个下载的事务。每个下载都有一个包含 3 列的块,其中包含开始日期、开始时间 (hh:mm.ss) 和开始毫秒数。每行的前 3 列是开始时间,一行中的最后 3 个单元格值是结束时间。

我想以这样的方式准备数据,即每一行中只有第一次和最后一次下载的三列(=行),如下所示。

我正在 excel 中使用 INDIRECT 和 ADDRESS 函数来完成工作。

这可以在 R 中完成吗?

我已经将数据加载到 R 中,如下所示。空单元格值存储为 NA。

nov <-read.csv(file = '././data/NovemberResults-uniq.csv',header = T,na.strings = FALSE,stringsAsFactors = FALSE)

R 中的数据

输入结果:

> dput(x = nov[1,])
structure(list(SD1 = structure(1L, .Label = "01-11-2015", class = "factor"), 
    ST1 = structure(1L, .Label = c(" 00:00:01", " 00:00:02", 
    " 00:00:11", " 00:00:13", " 00:00:27", " 00:00:28", " 01:13:16"
    ), class = "factor"), SMS1 = 323L, SD2 = structure(1L, .Label = " 2015-11-01 ", class = "factor"), 
    ST2 = structure(1L, .Label = c(" 00:00:01", " 00:00:02", 
    " 00:00:12", " 00:00:14", " 00:00:27", " 00:00:29", " 01:13:25"
    ), class = "factor"), SMS2 = 551L, SD3 = structure(1L, .Label = c("", 
    " 2015-11-01 "), class = "factor"), ST3 = structure(1L, .Label = c("", 
    " 00:00:27", " 01:13:33"), class = "factor"), SMS3 = NA_integer_, 
    SD4 = structure(1L, .Label = c("", " 2015-11-01 "), class = "factor"), 
    ST4 = structure(1L, .Label = c("", " 01:13:44"), class = "factor"), 
    SMS4 = NA_integer_), .Names = c("SD1", "ST1", "SMS1", "SD2", 
"ST2", "SMS2", "SD3", "ST3", "SMS3", "SD4", "ST4", "SMS4"), row.names = 1L, class = "data.frame")


SD1 ST1 SMS1    SD2 ST2 SMS2    SD3 ST3 SMS3    SD4 ST4 SMS4
01-11-2015   00:00:01   323  2015-11-01      00:00:01   551                     
01-11-2015   00:00:02   289  2015-11-01      00:00:02   618                     
01-11-2015   01:13:16   253  2015-11-01      01:13:25   511  2015-11-01      01:13:33   489  2015-11-01      01:13:44   870
01-11-2015   00:00:11   986  2015-11-01      00:00:12   602                     
01-11-2015   00:00:27   48   2015-11-01      00:00:27   391  2015-11-01      00:00:27   429         
01-11-2015   00:00:13   750  2015-11-01      00:00:14   255                     
01-11-2015   00:00:28   773  2015-11-01      00:00:29   114                     

忽略类型转换(例如将日期 + 时间字符列转换为一个 "datetime" POSIXct 列)可能的解决方案可能是:

# Read the data into a data.table using "white spaces" as separator.
# Important: Disable factors + interpret emtpy strings as "NA"
data <- read.table(header=TRUE, fill=TRUE, stringsAsFactors=FALSE, na.strings="", text=
"SD1    ST1 SMS1    SD2 ST2 SMS2    SD3 ST3 SMS3    SD4 ST4 SMS4
01-11-2015   00:00:01   323  2015-11-01      00:00:01   551
01-11-2015   00:00:02   289  2015-11-01      00:00:02   618
01-11-2015   01:13:16   253  2015-11-01      01:13:25   511  2015-11-01      01:13:33   489  2015-11-01      01:13:44   870
01-11-2015   00:00:11   986  2015-11-01      00:00:12   602
01-11-2015   00:00:27   48   2015-11-01      00:00:27   391  2015-11-01      00:00:27   429
01-11-2015   00:00:13   750  2015-11-01      00:00:14   255
01-11-2015   00:00:28   773  2015-11-01      00:00:29   114"
)

# Just for debugging purposes...
data
str(str)

# Append last available block of transaction event columns to the end
# ("ifelse" since the decision in which column to find the "last value" must be taken on a row-by-row base)
data$SD.End <- ifelse(!is.na(data$SD4),data$SD4,
                     ifelse(!is.na(data$SD3),data$SD3,
                            ifelse(!is.na(data$SD2),data$SD2, NA)))
data$ST.End <- ifelse(!is.na(data$ST4),data$ST4,
                     ifelse(!is.na(data$ST3),data$ST3,
                            ifelse(!is.na(data$ST2),data$ST2, NA)))
data$SMS.End <- ifelse(!is.na(data$SMS4),data$SMS4,
                   ifelse(!is.na(data$SMS3),data$SMS3,
                          ifelse(!is.na(data$SMS2),data$SMS2, NA)))

data

# Now prepare the output by "cutting" the wanted result into a new data.frame
result <- data.frame(c( data[,1:3], data[, 13:15]))

# show result
result

结果是:

> result
         SD1      ST1 SMS1     SD.End   ST.End SMS.End
1 01-11-2015 00:00:01  323 2015-11-01 00:00:01     551
2 01-11-2015 00:00:02  289 2015-11-01 00:00:02     618
3 01-11-2015 01:13:16  253 2015-11-01 01:13:44     870
4 01-11-2015 00:00:11  986 2015-11-01 00:00:12     602
5 01-11-2015 00:00:27   48 2015-11-01 00:00:27     429
6 01-11-2015 00:00:13  750 2015-11-01 00:00:14     255
7 01-11-2015 00:00:28  773 2015-11-01 00:00:29     114

核心问题是避免循环,但仍然是逐行工作,以决定从哪一列复制可用数据。必须这样做 "vectorized" 以避免性能下降,所以我使用了 ifelse.

可以使用 data.table:

实现针对 任意数量的交易事件列 的快速解决方案
# Preconditions for this solution:
# 1. Three columns per transaction event (download): Date, time, milliseconds
# 2. The download columns are at the beginning of the data.frame
# 3. There are no gaps within the downloads of row (in other words: NAs are always at the end)
# 4. Sufficient performance is only guaranteed if the number of columns is not to high (guess: several thousands)

# For efficiency I use a data.table instead of a data.frame
library(data.table)

# Read the data into a data.table using "white spaces" as separator.
# Important: Disable factors + interpret emtpy strings as "NA"
data <- read.table(header=TRUE, fill=TRUE, stringsAsFactors=FALSE, na.strings="", text=
                     "SD1    ST1 SMS1    SD2 ST2 SMS2    SD3 ST3 SMS3    SD4 ST4 SMS4
01-11-2015   00:00:01   323  2015-11-01      00:00:01   551
01-11-2015   00:00:02   289  2015-11-01      00:00:02   618
01-11-2015   01:13:16   253  2015-11-01      01:13:25   511  2015-11-01      01:13:33   489  2015-11-01      01:13:44   870
01-11-2015   00:00:11   986  2015-11-01      00:00:12   602
01-11-2015   00:00:27   48   2015-11-01      00:00:27   391  2015-11-01      00:00:27   429
01-11-2015   00:00:13   750  2015-11-01      00:00:14   255
01-11-2015   00:00:28   773  2015-11-01      00:00:29   114"
)

# Convert the data.frame into a data.table for efficient performance (and better processing syntax)
setDT(data)

# Specify the max. number of downloads per transaction in the data.frame.
# Since each download has three columns (data + time + milliseconds) derive this value from "ncol".
# If you have additional data columns you must set this value manually
max.num.of.downloads = ncol(data) / 3

# Calculate the number of empty cells ("columns") per row and add this value as new columns
data[, num.NA.cells := rowSums(is.na(data[, 1:(max.num.of.downloads*3), with=FALSE]))]

# Rough validation that NAs are consistent (three NAs per missing download)
stopifnot( nrow(data[(num.NA.cells %% 3) != 0,]) == 0 )

# Add a column containing the number of downloads
data[, downloads.count := max.num.of.downloads - (num.NA.cells / 3)]

# Now the big magic: For each group of data with the same transaction count: Add the "transaction end" columns.
# Note:
# a) .SD is a data table containing only the sub data (SD!) of the current group
# b) "with=FALSE" allows column indexes instead of names
# c) := is assignment by reference (creates new columns if they do not exist)
# d) The outer parens around the column names to be created ("SD.End") are required if you create or update more than one column at once with ":="
data[, (c("SD.End", "ST.End", "SMS.End")) := .SD[, seq((downloads.count - 1) * 3 + 1 , (downloads.count - 1) * 3 + 3), with=FALSE],
     by=downloads.count]

# data[, .N, by=downloads.count]  # just for debugging: Count the number of rows per downloads.count group

# "data" was now enriched with everything you need. Now you can just "cut out" what you need:
data[, .(SD1, ST1, SMS1, SD.End, ST.End, SMS.End)]

结果是一样的:

> data[, .(SD1, ST1, SMS1, SD.End, ST.End, SMS.End)]
          SD1      ST1 SMS1     SD.End   ST.End SMS.End
1: 01-11-2015 00:00:01  323 2015-11-01 00:00:01     551
2: 01-11-2015 00:00:02  289 2015-11-01 00:00:02     618
3: 01-11-2015 01:13:16  253 2015-11-01 01:13:44     870
4: 01-11-2015 00:00:11  986 2015-11-01 00:00:12     602
5: 01-11-2015 00:00:27   48 2015-11-01 00:00:27     429
6: 01-11-2015 00:00:13  750 2015-11-01 00:00:14     255
7: 01-11-2015 00:00:28  773 2015-11-01 00:00:29     114