如何从 data.frame 中提取第一个和最后一个填充的列块?
how to extract the first and last filled block of columns from a data.frame?
我们有文件下载时间的日志数据。
对于每个单独的交易,都会有一个开始和结束时间戳。
excel
中的原始数据
每一行都是一个包含多个下载的事务。每个下载都有一个包含 3 列的块,其中包含开始日期、开始时间 (hh:mm.ss) 和开始毫秒数。每行的前 3 列是开始时间,一行中的最后 3 个单元格值是结束时间。
我想以这样的方式准备数据,即每一行中只有第一次和最后一次下载的三列(=行),如下所示。
我正在 excel 中使用 INDIRECT 和 ADDRESS 函数来完成工作。
这可以在 R 中完成吗?
我已经将数据加载到 R 中,如下所示。空单元格值存储为 NA。
nov <-read.csv(file = '././data/NovemberResults-uniq.csv',header = T,na.strings = FALSE,stringsAsFactors = FALSE)
R 中的数据
输入结果:
> dput(x = nov[1,])
structure(list(SD1 = structure(1L, .Label = "01-11-2015", class = "factor"),
ST1 = structure(1L, .Label = c(" 00:00:01", " 00:00:02",
" 00:00:11", " 00:00:13", " 00:00:27", " 00:00:28", " 01:13:16"
), class = "factor"), SMS1 = 323L, SD2 = structure(1L, .Label = " 2015-11-01 ", class = "factor"),
ST2 = structure(1L, .Label = c(" 00:00:01", " 00:00:02",
" 00:00:12", " 00:00:14", " 00:00:27", " 00:00:29", " 01:13:25"
), class = "factor"), SMS2 = 551L, SD3 = structure(1L, .Label = c("",
" 2015-11-01 "), class = "factor"), ST3 = structure(1L, .Label = c("",
" 00:00:27", " 01:13:33"), class = "factor"), SMS3 = NA_integer_,
SD4 = structure(1L, .Label = c("", " 2015-11-01 "), class = "factor"),
ST4 = structure(1L, .Label = c("", " 01:13:44"), class = "factor"),
SMS4 = NA_integer_), .Names = c("SD1", "ST1", "SMS1", "SD2",
"ST2", "SMS2", "SD3", "ST3", "SMS3", "SD4", "ST4", "SMS4"), row.names = 1L, class = "data.frame")
SD1 ST1 SMS1 SD2 ST2 SMS2 SD3 ST3 SMS3 SD4 ST4 SMS4
01-11-2015 00:00:01 323 2015-11-01 00:00:01 551
01-11-2015 00:00:02 289 2015-11-01 00:00:02 618
01-11-2015 01:13:16 253 2015-11-01 01:13:25 511 2015-11-01 01:13:33 489 2015-11-01 01:13:44 870
01-11-2015 00:00:11 986 2015-11-01 00:00:12 602
01-11-2015 00:00:27 48 2015-11-01 00:00:27 391 2015-11-01 00:00:27 429
01-11-2015 00:00:13 750 2015-11-01 00:00:14 255
01-11-2015 00:00:28 773 2015-11-01 00:00:29 114
忽略类型转换(例如将日期 + 时间字符列转换为一个 "datetime" POSIXct 列)可能的解决方案可能是:
# Read the data into a data.table using "white spaces" as separator.
# Important: Disable factors + interpret emtpy strings as "NA"
data <- read.table(header=TRUE, fill=TRUE, stringsAsFactors=FALSE, na.strings="", text=
"SD1 ST1 SMS1 SD2 ST2 SMS2 SD3 ST3 SMS3 SD4 ST4 SMS4
01-11-2015 00:00:01 323 2015-11-01 00:00:01 551
01-11-2015 00:00:02 289 2015-11-01 00:00:02 618
01-11-2015 01:13:16 253 2015-11-01 01:13:25 511 2015-11-01 01:13:33 489 2015-11-01 01:13:44 870
01-11-2015 00:00:11 986 2015-11-01 00:00:12 602
01-11-2015 00:00:27 48 2015-11-01 00:00:27 391 2015-11-01 00:00:27 429
01-11-2015 00:00:13 750 2015-11-01 00:00:14 255
01-11-2015 00:00:28 773 2015-11-01 00:00:29 114"
)
# Just for debugging purposes...
data
str(str)
# Append last available block of transaction event columns to the end
# ("ifelse" since the decision in which column to find the "last value" must be taken on a row-by-row base)
data$SD.End <- ifelse(!is.na(data$SD4),data$SD4,
ifelse(!is.na(data$SD3),data$SD3,
ifelse(!is.na(data$SD2),data$SD2, NA)))
data$ST.End <- ifelse(!is.na(data$ST4),data$ST4,
ifelse(!is.na(data$ST3),data$ST3,
ifelse(!is.na(data$ST2),data$ST2, NA)))
data$SMS.End <- ifelse(!is.na(data$SMS4),data$SMS4,
ifelse(!is.na(data$SMS3),data$SMS3,
ifelse(!is.na(data$SMS2),data$SMS2, NA)))
data
# Now prepare the output by "cutting" the wanted result into a new data.frame
result <- data.frame(c( data[,1:3], data[, 13:15]))
# show result
result
结果是:
> result
SD1 ST1 SMS1 SD.End ST.End SMS.End
1 01-11-2015 00:00:01 323 2015-11-01 00:00:01 551
2 01-11-2015 00:00:02 289 2015-11-01 00:00:02 618
3 01-11-2015 01:13:16 253 2015-11-01 01:13:44 870
4 01-11-2015 00:00:11 986 2015-11-01 00:00:12 602
5 01-11-2015 00:00:27 48 2015-11-01 00:00:27 429
6 01-11-2015 00:00:13 750 2015-11-01 00:00:14 255
7 01-11-2015 00:00:28 773 2015-11-01 00:00:29 114
核心问题是避免循环,但仍然是逐行工作,以决定从哪一列复制可用数据。必须这样做 "vectorized" 以避免性能下降,所以我使用了 ifelse
.
可以使用 data.table
:
实现针对 任意数量的交易事件列 的快速解决方案
# Preconditions for this solution:
# 1. Three columns per transaction event (download): Date, time, milliseconds
# 2. The download columns are at the beginning of the data.frame
# 3. There are no gaps within the downloads of row (in other words: NAs are always at the end)
# 4. Sufficient performance is only guaranteed if the number of columns is not to high (guess: several thousands)
# For efficiency I use a data.table instead of a data.frame
library(data.table)
# Read the data into a data.table using "white spaces" as separator.
# Important: Disable factors + interpret emtpy strings as "NA"
data <- read.table(header=TRUE, fill=TRUE, stringsAsFactors=FALSE, na.strings="", text=
"SD1 ST1 SMS1 SD2 ST2 SMS2 SD3 ST3 SMS3 SD4 ST4 SMS4
01-11-2015 00:00:01 323 2015-11-01 00:00:01 551
01-11-2015 00:00:02 289 2015-11-01 00:00:02 618
01-11-2015 01:13:16 253 2015-11-01 01:13:25 511 2015-11-01 01:13:33 489 2015-11-01 01:13:44 870
01-11-2015 00:00:11 986 2015-11-01 00:00:12 602
01-11-2015 00:00:27 48 2015-11-01 00:00:27 391 2015-11-01 00:00:27 429
01-11-2015 00:00:13 750 2015-11-01 00:00:14 255
01-11-2015 00:00:28 773 2015-11-01 00:00:29 114"
)
# Convert the data.frame into a data.table for efficient performance (and better processing syntax)
setDT(data)
# Specify the max. number of downloads per transaction in the data.frame.
# Since each download has three columns (data + time + milliseconds) derive this value from "ncol".
# If you have additional data columns you must set this value manually
max.num.of.downloads = ncol(data) / 3
# Calculate the number of empty cells ("columns") per row and add this value as new columns
data[, num.NA.cells := rowSums(is.na(data[, 1:(max.num.of.downloads*3), with=FALSE]))]
# Rough validation that NAs are consistent (three NAs per missing download)
stopifnot( nrow(data[(num.NA.cells %% 3) != 0,]) == 0 )
# Add a column containing the number of downloads
data[, downloads.count := max.num.of.downloads - (num.NA.cells / 3)]
# Now the big magic: For each group of data with the same transaction count: Add the "transaction end" columns.
# Note:
# a) .SD is a data table containing only the sub data (SD!) of the current group
# b) "with=FALSE" allows column indexes instead of names
# c) := is assignment by reference (creates new columns if they do not exist)
# d) The outer parens around the column names to be created ("SD.End") are required if you create or update more than one column at once with ":="
data[, (c("SD.End", "ST.End", "SMS.End")) := .SD[, seq((downloads.count - 1) * 3 + 1 , (downloads.count - 1) * 3 + 3), with=FALSE],
by=downloads.count]
# data[, .N, by=downloads.count] # just for debugging: Count the number of rows per downloads.count group
# "data" was now enriched with everything you need. Now you can just "cut out" what you need:
data[, .(SD1, ST1, SMS1, SD.End, ST.End, SMS.End)]
结果是一样的:
> data[, .(SD1, ST1, SMS1, SD.End, ST.End, SMS.End)]
SD1 ST1 SMS1 SD.End ST.End SMS.End
1: 01-11-2015 00:00:01 323 2015-11-01 00:00:01 551
2: 01-11-2015 00:00:02 289 2015-11-01 00:00:02 618
3: 01-11-2015 01:13:16 253 2015-11-01 01:13:44 870
4: 01-11-2015 00:00:11 986 2015-11-01 00:00:12 602
5: 01-11-2015 00:00:27 48 2015-11-01 00:00:27 429
6: 01-11-2015 00:00:13 750 2015-11-01 00:00:14 255
7: 01-11-2015 00:00:28 773 2015-11-01 00:00:29 114
我们有文件下载时间的日志数据。
对于每个单独的交易,都会有一个开始和结束时间戳。
excel
中的原始数据每一行都是一个包含多个下载的事务。每个下载都有一个包含 3 列的块,其中包含开始日期、开始时间 (hh:mm.ss) 和开始毫秒数。每行的前 3 列是开始时间,一行中的最后 3 个单元格值是结束时间。
我想以这样的方式准备数据,即每一行中只有第一次和最后一次下载的三列(=行),如下所示。
我正在 excel 中使用 INDIRECT 和 ADDRESS 函数来完成工作。
这可以在 R 中完成吗?
我已经将数据加载到 R 中,如下所示。空单元格值存储为 NA。
nov <-read.csv(file = '././data/NovemberResults-uniq.csv',header = T,na.strings = FALSE,stringsAsFactors = FALSE)
R 中的数据
输入结果:
> dput(x = nov[1,])
structure(list(SD1 = structure(1L, .Label = "01-11-2015", class = "factor"),
ST1 = structure(1L, .Label = c(" 00:00:01", " 00:00:02",
" 00:00:11", " 00:00:13", " 00:00:27", " 00:00:28", " 01:13:16"
), class = "factor"), SMS1 = 323L, SD2 = structure(1L, .Label = " 2015-11-01 ", class = "factor"),
ST2 = structure(1L, .Label = c(" 00:00:01", " 00:00:02",
" 00:00:12", " 00:00:14", " 00:00:27", " 00:00:29", " 01:13:25"
), class = "factor"), SMS2 = 551L, SD3 = structure(1L, .Label = c("",
" 2015-11-01 "), class = "factor"), ST3 = structure(1L, .Label = c("",
" 00:00:27", " 01:13:33"), class = "factor"), SMS3 = NA_integer_,
SD4 = structure(1L, .Label = c("", " 2015-11-01 "), class = "factor"),
ST4 = structure(1L, .Label = c("", " 01:13:44"), class = "factor"),
SMS4 = NA_integer_), .Names = c("SD1", "ST1", "SMS1", "SD2",
"ST2", "SMS2", "SD3", "ST3", "SMS3", "SD4", "ST4", "SMS4"), row.names = 1L, class = "data.frame")
SD1 ST1 SMS1 SD2 ST2 SMS2 SD3 ST3 SMS3 SD4 ST4 SMS4
01-11-2015 00:00:01 323 2015-11-01 00:00:01 551
01-11-2015 00:00:02 289 2015-11-01 00:00:02 618
01-11-2015 01:13:16 253 2015-11-01 01:13:25 511 2015-11-01 01:13:33 489 2015-11-01 01:13:44 870
01-11-2015 00:00:11 986 2015-11-01 00:00:12 602
01-11-2015 00:00:27 48 2015-11-01 00:00:27 391 2015-11-01 00:00:27 429
01-11-2015 00:00:13 750 2015-11-01 00:00:14 255
01-11-2015 00:00:28 773 2015-11-01 00:00:29 114
忽略类型转换(例如将日期 + 时间字符列转换为一个 "datetime" POSIXct 列)可能的解决方案可能是:
# Read the data into a data.table using "white spaces" as separator.
# Important: Disable factors + interpret emtpy strings as "NA"
data <- read.table(header=TRUE, fill=TRUE, stringsAsFactors=FALSE, na.strings="", text=
"SD1 ST1 SMS1 SD2 ST2 SMS2 SD3 ST3 SMS3 SD4 ST4 SMS4
01-11-2015 00:00:01 323 2015-11-01 00:00:01 551
01-11-2015 00:00:02 289 2015-11-01 00:00:02 618
01-11-2015 01:13:16 253 2015-11-01 01:13:25 511 2015-11-01 01:13:33 489 2015-11-01 01:13:44 870
01-11-2015 00:00:11 986 2015-11-01 00:00:12 602
01-11-2015 00:00:27 48 2015-11-01 00:00:27 391 2015-11-01 00:00:27 429
01-11-2015 00:00:13 750 2015-11-01 00:00:14 255
01-11-2015 00:00:28 773 2015-11-01 00:00:29 114"
)
# Just for debugging purposes...
data
str(str)
# Append last available block of transaction event columns to the end
# ("ifelse" since the decision in which column to find the "last value" must be taken on a row-by-row base)
data$SD.End <- ifelse(!is.na(data$SD4),data$SD4,
ifelse(!is.na(data$SD3),data$SD3,
ifelse(!is.na(data$SD2),data$SD2, NA)))
data$ST.End <- ifelse(!is.na(data$ST4),data$ST4,
ifelse(!is.na(data$ST3),data$ST3,
ifelse(!is.na(data$ST2),data$ST2, NA)))
data$SMS.End <- ifelse(!is.na(data$SMS4),data$SMS4,
ifelse(!is.na(data$SMS3),data$SMS3,
ifelse(!is.na(data$SMS2),data$SMS2, NA)))
data
# Now prepare the output by "cutting" the wanted result into a new data.frame
result <- data.frame(c( data[,1:3], data[, 13:15]))
# show result
result
结果是:
> result
SD1 ST1 SMS1 SD.End ST.End SMS.End
1 01-11-2015 00:00:01 323 2015-11-01 00:00:01 551
2 01-11-2015 00:00:02 289 2015-11-01 00:00:02 618
3 01-11-2015 01:13:16 253 2015-11-01 01:13:44 870
4 01-11-2015 00:00:11 986 2015-11-01 00:00:12 602
5 01-11-2015 00:00:27 48 2015-11-01 00:00:27 429
6 01-11-2015 00:00:13 750 2015-11-01 00:00:14 255
7 01-11-2015 00:00:28 773 2015-11-01 00:00:29 114
核心问题是避免循环,但仍然是逐行工作,以决定从哪一列复制可用数据。必须这样做 "vectorized" 以避免性能下降,所以我使用了 ifelse
.
可以使用 data.table
:
# Preconditions for this solution:
# 1. Three columns per transaction event (download): Date, time, milliseconds
# 2. The download columns are at the beginning of the data.frame
# 3. There are no gaps within the downloads of row (in other words: NAs are always at the end)
# 4. Sufficient performance is only guaranteed if the number of columns is not to high (guess: several thousands)
# For efficiency I use a data.table instead of a data.frame
library(data.table)
# Read the data into a data.table using "white spaces" as separator.
# Important: Disable factors + interpret emtpy strings as "NA"
data <- read.table(header=TRUE, fill=TRUE, stringsAsFactors=FALSE, na.strings="", text=
"SD1 ST1 SMS1 SD2 ST2 SMS2 SD3 ST3 SMS3 SD4 ST4 SMS4
01-11-2015 00:00:01 323 2015-11-01 00:00:01 551
01-11-2015 00:00:02 289 2015-11-01 00:00:02 618
01-11-2015 01:13:16 253 2015-11-01 01:13:25 511 2015-11-01 01:13:33 489 2015-11-01 01:13:44 870
01-11-2015 00:00:11 986 2015-11-01 00:00:12 602
01-11-2015 00:00:27 48 2015-11-01 00:00:27 391 2015-11-01 00:00:27 429
01-11-2015 00:00:13 750 2015-11-01 00:00:14 255
01-11-2015 00:00:28 773 2015-11-01 00:00:29 114"
)
# Convert the data.frame into a data.table for efficient performance (and better processing syntax)
setDT(data)
# Specify the max. number of downloads per transaction in the data.frame.
# Since each download has three columns (data + time + milliseconds) derive this value from "ncol".
# If you have additional data columns you must set this value manually
max.num.of.downloads = ncol(data) / 3
# Calculate the number of empty cells ("columns") per row and add this value as new columns
data[, num.NA.cells := rowSums(is.na(data[, 1:(max.num.of.downloads*3), with=FALSE]))]
# Rough validation that NAs are consistent (three NAs per missing download)
stopifnot( nrow(data[(num.NA.cells %% 3) != 0,]) == 0 )
# Add a column containing the number of downloads
data[, downloads.count := max.num.of.downloads - (num.NA.cells / 3)]
# Now the big magic: For each group of data with the same transaction count: Add the "transaction end" columns.
# Note:
# a) .SD is a data table containing only the sub data (SD!) of the current group
# b) "with=FALSE" allows column indexes instead of names
# c) := is assignment by reference (creates new columns if they do not exist)
# d) The outer parens around the column names to be created ("SD.End") are required if you create or update more than one column at once with ":="
data[, (c("SD.End", "ST.End", "SMS.End")) := .SD[, seq((downloads.count - 1) * 3 + 1 , (downloads.count - 1) * 3 + 3), with=FALSE],
by=downloads.count]
# data[, .N, by=downloads.count] # just for debugging: Count the number of rows per downloads.count group
# "data" was now enriched with everything you need. Now you can just "cut out" what you need:
data[, .(SD1, ST1, SMS1, SD.End, ST.End, SMS.End)]
结果是一样的:
> data[, .(SD1, ST1, SMS1, SD.End, ST.End, SMS.End)]
SD1 ST1 SMS1 SD.End ST.End SMS.End
1: 01-11-2015 00:00:01 323 2015-11-01 00:00:01 551
2: 01-11-2015 00:00:02 289 2015-11-01 00:00:02 618
3: 01-11-2015 01:13:16 253 2015-11-01 01:13:44 870
4: 01-11-2015 00:00:11 986 2015-11-01 00:00:12 602
5: 01-11-2015 00:00:27 48 2015-11-01 00:00:27 429
6: 01-11-2015 00:00:13 750 2015-11-01 00:00:14 255
7: 01-11-2015 00:00:28 773 2015-11-01 00:00:29 114