在 R 中重组一个独特的(NYC MTA 旋转门)数据集
Reorganizing a unique (NYC MTA turnstile) dataset in R
我有一个外观独特的数据集(NYC MTA 十字转门数据),我需要以某种方式重新组织它以执行一些分析。我编写的代码可以运行但效率不高,因为它是一个非常大的数据集。我希望有人可以提出更好的方法。
相关数据集有 43 列。第 1-3 列是唯一标识符(即特定车站的旋转门)。然后第 4-8 列标识计量时间、计量类型、进入和退出。 9-13,然后是其余的列,直到 43 列都遵循相同的模式。数据集很难看,所以我不想在这里 post,但你可以在下面 link 中找到它。您将不得不查看 2014 年 10 月 18 日之前的数据。
http://web.mta.info/developers/turnstile.html
#Vector of column numbers that identifies the break
a <- c(4, 9, 14, 19, 24, 29, 34, 39)
#The actual loop to re-sort the data
for (i in 1:nrow(data)) {
for (j in 1:length(a)) {
if (j == 8 ){ all <- rbind(all, cbind(data[i, 1:3], data[i, a[j]:43])) }
else { all <- rbind(all, cbind(data[i, 1:3], data[i,a[j]:(a[j+1]-1)])) } } }
所有这一切的最终结果是这样的。
1 2 3 1 2 3 4 5
5083 H026 R137 00-00-00 10-04-14 00:00:00 REGULAR 4072851 10491832
50831 H026 R137 00-00-00 10-04-14 04:00:00 REGULAR 4072918 10492356
50832 H026 R137 00-00-00 10-04-14 08:00:00 REGULAR 4073125 10492613
50833 H026 R137 00-00-00 10-04-14 12:00:00 REGULAR 4073511 10493116
50834 H026 R137 00-00-00 10-04-14 16:00:00 REGULAR 4073820 10493877
50835 H026 R137 00-00-00 10-04-14 20:00:00 REGULAR 4074140 10494817
它有效,但我知道有更有效的方法可以做到这一点。任何帮助将不胜感激!
编辑:
我应该对此添加更多内容,因为我遗漏了一些可能会改变此方法的关键部分。在我用 read.csv 读入数据后,我只用几米(第 2 列)对数据进行了子集化。因为我喜欢这个建议,所以我将子集数据转换为如下所示的字符串。这实际上表现相当不错,但任何进一步的建议将不胜感激!
out1 <- function() {
data <- read.csv(name, header=FALSE)
##Isolate data for stations included in network area
station <- subset(data, V2%in% station_names)
data <- apply(station, 1, paste, collapse=",")
starts <- seq(from=4, to=43, by=5)
new_data <- rbindlist(lapply(strsplit(data, ","), function(x) {
rbindlist(lapply(starts, function(y) {
as.list(x[c(1:3, y:(y+4))])
}))
}))
setnames(new_data, colnames(new_data), c("C.A", "UNIT", "SCP", "DATE", "TIME","DESC", "ENTRIES", "EXIT"))
new_data <- as.data.frame(new_data)
}
如果您不介意在数据加载时进行处理:
# data via http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description_pre-10-18-2014.txt
data <- readLines(textConnection("A002,R051,02-00-00,03-21-10,00:00:00,REGULAR,002670738,000917107,03-21-10,04:00:00,REGULAR,002670738,000917107,03-21-10,08:00:00,REGULAR,002670746,000917117,03-21-10,12:00:00,REGULAR,002670790,000917166,03-21-10,16:00:00,REGULAR,002670932,000917204,03-21-10,20:00:00,REGULAR,002671164,000917230,03-22-10,00:00:00,REGULAR,002671181,000917231,03-22-10,04:00:00,REGULAR,002671181,000917231
A002,R051,02-00-00,03-22-10,08:00:00,REGULAR,002671220,000917324,03-22-10,12:00:00,REGULAR,002671364,000917640,03-22-10,16:00:00,REGULAR,002671651,000917719,03-22-10,20:00:00,REGULAR,002672430,000917789,03-23-10,00:00:00,REGULAR,002672473,000917795,03-23-10,04:00:00,REGULAR,002672474,000917795,03-23-10,08:00:00,REGULAR,002672516,000917876,03-23-10,12:00:00,REGULAR,002672652,000917934
A002,R051,02-00-00,03-23-10,16:00:00,REGULAR,002672879,000917996,03-23-10,20:00:00,REGULAR,002673636,000918073,03-24-10,00:00:00,REGULAR,002673683,000918079,03-24-10,04:00:00,REGULAR,002673683,000918079,03-24-10,08:00:00,REGULAR,002673722,000918171,03-24-10,12:00:00,REGULAR,002673876,000918514,03-24-10,16:00:00,REGULAR,002674221,000918594,03-24-10,20:00:00,REGULAR,002675082,000918671
A002,R051,02-00-00,03-25-10,00:00:00,REGULAR,002675153,000918675,03-25-10,04:00:00,REGULAR,002675153,000918675,03-25-10,08:00:00,REGULAR,002675190,000918752,03-25-10,12:00:00,REGULAR,002675345,000919053,03-25-10,16:00:00,REGULAR,002675676,000919118,03-25-10,20:00:00,REGULAR,002676557,000919179,03-26-10,00:00:00,REGULAR,002676688,000919207,03-26-10,04:00:00,REGULAR,002676694,000919208
A002,R051,02-00-00,03-26-10,08:00:00,REGULAR,002676735,000919287,03-26-10,12:00:00,REGULAR,002676887,000919607,03-26-10,16:00:00,REGULAR,002677213,000919680,03-26-10,20:00:00,REGULAR,002678039,000919743,03-27-10,00:00:00,REGULAR,002678144,000919756,03-27-10,04:00:00,REGULAR,002678145,000919756,03-27-10,08:00:00,REGULAR,002678155,000919777,03-27-10,12:00:00,REGULAR,002678247,000919859
A002,R051,02-00-00,03-27-10,16:00:00,REGULAR,002678531,000919908,03-27-10,20:00:00,REGULAR,002678892,000919964,03-28-10,00:00:00,REGULAR,002678929,000919966,03-28-10,04:00:00,REGULAR,002678929,000919966,03-28-10,08:00:00,REGULAR,002678935,000919982,03-28-10,12:00:00,REGULAR,002679003,000920006,03-28-10,16:00:00,REGULAR,002679231,000920059,03-28-10,20:00:00,REGULAR,002679475,000920098"))
library(data.table)
starts <- seq(from=4, to=43, by=5)
new_data <- rbindlist(lapply(strsplit(data, ","), function(x) {
rbindlist(lapply(starts, function(y) {
as.list(x[c(1:3, y:(y+4))])
}))
}))
setnames(new_data, colnames(new_data), c("control_area", "unit", "scp", "date", "time", "description", "entries", "exits"))
dplyr::glimpse(new_data)
## Observations: 48
## Variables:
## $ control_area (fctr) A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A0...
## $ unit (fctr) R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R0...
## $ scp (fctr) 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, ...
## $ date (fctr) 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-22-10, 03-22-10, ...
## $ time (fctr) 00:00:00, 04:00:00, 08:00:00, 12:00:00, 16:00:00, 20:00:00, 00:00:00, 04:00:00, ...
## $ description (fctr) REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR,...
## $ entries (fctr) 002670738, 002670738, 002670746, 002670790, 002670932, 002671164, 002671181, 002...
## $ exits (fctr) 000917107, 000917107, 000917117, 000917166, 000917204, 000917230, 000917231, 000...
这里有另一种可供考虑的方法。它使用 "stringi" 包和我的 "splitstackshape" 包。
library(splitstackshape)
library(stringi)
我们将使用 the URL shared by @hrbmstr
中确定的字段描述中的名称
Names <- scan(what = "character", sep = ",",
text = paste0(
"C/A,UNIT,SCP,DATE1,TIME1,DESC1,ENTRIES1,EXITS1,",
"DATE2,TIME2,DESC2,ENTRIES2,EXITS2,DATE3,TIME3,DESC3,",
"ENTRIES3,EXITS3,DATE4,TIME4,DESC4,ENTRIES4,EXITS4,",
"DATE5,TIME5,DESC5,ENTRIES5,EXITS5,DATE6,TIME6,DESC6,",
"ENTRIES6,EXITS6,DATE7,TIME7,DESC7,ENTRIES7,EXITS7,",
"DATE8,TIME8,DESC8,ENTRIES8,EXITS8"))
## What are the unique variable "stubs"?
isRepeated <- unique(gsub("\d", "", Names[4:length(Names)]))
接下来,我们编写一个使用上述常量的函数。该函数执行以下操作:
- 使用
stri_split_fixed
将 vector
拆分为 matrix
。
- 修剪任何多余的空格。
- 将
matrix
转换为 data.table
并分配相关名称。
- 使用
merged.stack
从 "wide" 形式变为 "semi-long" 形式。
函数如下:
funAM <- function(invec) {
temp <- stri_split_fixed(invec, ",", simplify = TRUE)
temp <- `dim<-`(stri_trim_both(temp), dim(temp))
DT <- setnames(as.data.table(temp), Names)
merged.stack(getanID(DT, 1:3), var.stubs = isRepeated,
sep = "var.stubs")
}
让我们试试看:
## Try a dataset where we know there are unbalanced numbers of observations...
data <- readLines("http://web.mta.info/developers/data/nyct/turnstile/turnstile_130615.txt")
我们将在刚刚创建的 data
对象上应用函数:
system.time(out <- funAM(data)) ## Reasonably fast
# user system elapsed
# 1.25 0.02 1.29
out
# C/A UNIT SCP .id .time_1 DATE TIME DESC ENTRIES EXITS
# 1: A002 R051 02-00-00 1 1 06-08-13 00:00:00 REGULAR 004153504 001427135
# 2: A002 R051 02-00-00 1 2 06-08-13 04:00:00 REGULAR 004153535 001427138
# 3: A002 R051 02-00-00 1 3 06-08-13 08:00:00 REGULAR 004153559 001427177
# 4: A002 R051 02-00-00 1 4 06-08-13 12:00:00 REGULAR 004153683 001427255
# 5: A002 R051 02-00-00 1 5 06-08-13 16:00:00 REGULAR 004153959 001427320
# ---
# 241492: TRAM2 R469 00-05-01 6 4
# 241493: TRAM2 R469 00-05-01 6 5
# 241494: TRAM2 R469 00-05-01 6 6
# 241495: TRAM2 R469 00-05-01 6 7
# 241496: TRAM2 R469 00-05-01 6 8
与@hrbmstr 的方法相比,时间安排如下:
funHRB <- function() {
starts <- seq(from=4, to=43, by=5)
new_data <- rbindlist(lapply(strsplit(data, ","), function(x) {
rbindlist(lapply(starts, function(y) {
as.list(x[c(1:3, y:(y+4))])
}))
}))
setnames(new_data, colnames(new_data),
c("control_area", "unit", "scp", "date",
"time", "description", "entries", "exits"))
new_data
}
system.time(out2 <- funHRB())
# user system elapsed
# 23.59 0.03 23.77
同样通过比较,这两种方法都比我假设的 OP 的第一步要快得多,我假设是使用 read.csv
或类似的方法首先将数据导入 R地方。对我来说,使用相同的数据集大约需要一分钟:
system.time(DF <- read.csv(
header = FALSE, col.names = Names,
strip.white = TRUE,
colClasses = rep("character", length(Names)),
text = data))
# user system elapsed
# 66.01 0.07 66.91
我有一个外观独特的数据集(NYC MTA 十字转门数据),我需要以某种方式重新组织它以执行一些分析。我编写的代码可以运行但效率不高,因为它是一个非常大的数据集。我希望有人可以提出更好的方法。
相关数据集有 43 列。第 1-3 列是唯一标识符(即特定车站的旋转门)。然后第 4-8 列标识计量时间、计量类型、进入和退出。 9-13,然后是其余的列,直到 43 列都遵循相同的模式。数据集很难看,所以我不想在这里 post,但你可以在下面 link 中找到它。您将不得不查看 2014 年 10 月 18 日之前的数据。
http://web.mta.info/developers/turnstile.html
#Vector of column numbers that identifies the break
a <- c(4, 9, 14, 19, 24, 29, 34, 39)
#The actual loop to re-sort the data
for (i in 1:nrow(data)) {
for (j in 1:length(a)) {
if (j == 8 ){ all <- rbind(all, cbind(data[i, 1:3], data[i, a[j]:43])) }
else { all <- rbind(all, cbind(data[i, 1:3], data[i,a[j]:(a[j+1]-1)])) } } }
所有这一切的最终结果是这样的。
1 2 3 1 2 3 4 5
5083 H026 R137 00-00-00 10-04-14 00:00:00 REGULAR 4072851 10491832
50831 H026 R137 00-00-00 10-04-14 04:00:00 REGULAR 4072918 10492356
50832 H026 R137 00-00-00 10-04-14 08:00:00 REGULAR 4073125 10492613
50833 H026 R137 00-00-00 10-04-14 12:00:00 REGULAR 4073511 10493116
50834 H026 R137 00-00-00 10-04-14 16:00:00 REGULAR 4073820 10493877
50835 H026 R137 00-00-00 10-04-14 20:00:00 REGULAR 4074140 10494817
它有效,但我知道有更有效的方法可以做到这一点。任何帮助将不胜感激!
编辑:
我应该对此添加更多内容,因为我遗漏了一些可能会改变此方法的关键部分。在我用 read.csv 读入数据后,我只用几米(第 2 列)对数据进行了子集化。因为我喜欢这个建议,所以我将子集数据转换为如下所示的字符串。这实际上表现相当不错,但任何进一步的建议将不胜感激!
out1 <- function() {
data <- read.csv(name, header=FALSE)
##Isolate data for stations included in network area
station <- subset(data, V2%in% station_names)
data <- apply(station, 1, paste, collapse=",")
starts <- seq(from=4, to=43, by=5)
new_data <- rbindlist(lapply(strsplit(data, ","), function(x) {
rbindlist(lapply(starts, function(y) {
as.list(x[c(1:3, y:(y+4))])
}))
}))
setnames(new_data, colnames(new_data), c("C.A", "UNIT", "SCP", "DATE", "TIME","DESC", "ENTRIES", "EXIT"))
new_data <- as.data.frame(new_data)
}
如果您不介意在数据加载时进行处理:
# data via http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description_pre-10-18-2014.txt
data <- readLines(textConnection("A002,R051,02-00-00,03-21-10,00:00:00,REGULAR,002670738,000917107,03-21-10,04:00:00,REGULAR,002670738,000917107,03-21-10,08:00:00,REGULAR,002670746,000917117,03-21-10,12:00:00,REGULAR,002670790,000917166,03-21-10,16:00:00,REGULAR,002670932,000917204,03-21-10,20:00:00,REGULAR,002671164,000917230,03-22-10,00:00:00,REGULAR,002671181,000917231,03-22-10,04:00:00,REGULAR,002671181,000917231
A002,R051,02-00-00,03-22-10,08:00:00,REGULAR,002671220,000917324,03-22-10,12:00:00,REGULAR,002671364,000917640,03-22-10,16:00:00,REGULAR,002671651,000917719,03-22-10,20:00:00,REGULAR,002672430,000917789,03-23-10,00:00:00,REGULAR,002672473,000917795,03-23-10,04:00:00,REGULAR,002672474,000917795,03-23-10,08:00:00,REGULAR,002672516,000917876,03-23-10,12:00:00,REGULAR,002672652,000917934
A002,R051,02-00-00,03-23-10,16:00:00,REGULAR,002672879,000917996,03-23-10,20:00:00,REGULAR,002673636,000918073,03-24-10,00:00:00,REGULAR,002673683,000918079,03-24-10,04:00:00,REGULAR,002673683,000918079,03-24-10,08:00:00,REGULAR,002673722,000918171,03-24-10,12:00:00,REGULAR,002673876,000918514,03-24-10,16:00:00,REGULAR,002674221,000918594,03-24-10,20:00:00,REGULAR,002675082,000918671
A002,R051,02-00-00,03-25-10,00:00:00,REGULAR,002675153,000918675,03-25-10,04:00:00,REGULAR,002675153,000918675,03-25-10,08:00:00,REGULAR,002675190,000918752,03-25-10,12:00:00,REGULAR,002675345,000919053,03-25-10,16:00:00,REGULAR,002675676,000919118,03-25-10,20:00:00,REGULAR,002676557,000919179,03-26-10,00:00:00,REGULAR,002676688,000919207,03-26-10,04:00:00,REGULAR,002676694,000919208
A002,R051,02-00-00,03-26-10,08:00:00,REGULAR,002676735,000919287,03-26-10,12:00:00,REGULAR,002676887,000919607,03-26-10,16:00:00,REGULAR,002677213,000919680,03-26-10,20:00:00,REGULAR,002678039,000919743,03-27-10,00:00:00,REGULAR,002678144,000919756,03-27-10,04:00:00,REGULAR,002678145,000919756,03-27-10,08:00:00,REGULAR,002678155,000919777,03-27-10,12:00:00,REGULAR,002678247,000919859
A002,R051,02-00-00,03-27-10,16:00:00,REGULAR,002678531,000919908,03-27-10,20:00:00,REGULAR,002678892,000919964,03-28-10,00:00:00,REGULAR,002678929,000919966,03-28-10,04:00:00,REGULAR,002678929,000919966,03-28-10,08:00:00,REGULAR,002678935,000919982,03-28-10,12:00:00,REGULAR,002679003,000920006,03-28-10,16:00:00,REGULAR,002679231,000920059,03-28-10,20:00:00,REGULAR,002679475,000920098"))
library(data.table)
starts <- seq(from=4, to=43, by=5)
new_data <- rbindlist(lapply(strsplit(data, ","), function(x) {
rbindlist(lapply(starts, function(y) {
as.list(x[c(1:3, y:(y+4))])
}))
}))
setnames(new_data, colnames(new_data), c("control_area", "unit", "scp", "date", "time", "description", "entries", "exits"))
dplyr::glimpse(new_data)
## Observations: 48
## Variables:
## $ control_area (fctr) A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A002, A0...
## $ unit (fctr) R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R051, R0...
## $ scp (fctr) 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, 02-00-00, ...
## $ date (fctr) 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-21-10, 03-22-10, 03-22-10, ...
## $ time (fctr) 00:00:00, 04:00:00, 08:00:00, 12:00:00, 16:00:00, 20:00:00, 00:00:00, 04:00:00, ...
## $ description (fctr) REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR, REGULAR,...
## $ entries (fctr) 002670738, 002670738, 002670746, 002670790, 002670932, 002671164, 002671181, 002...
## $ exits (fctr) 000917107, 000917107, 000917117, 000917166, 000917204, 000917230, 000917231, 000...
这里有另一种可供考虑的方法。它使用 "stringi" 包和我的 "splitstackshape" 包。
library(splitstackshape)
library(stringi)
我们将使用 the URL shared by @hrbmstr
中确定的字段描述中的名称Names <- scan(what = "character", sep = ",",
text = paste0(
"C/A,UNIT,SCP,DATE1,TIME1,DESC1,ENTRIES1,EXITS1,",
"DATE2,TIME2,DESC2,ENTRIES2,EXITS2,DATE3,TIME3,DESC3,",
"ENTRIES3,EXITS3,DATE4,TIME4,DESC4,ENTRIES4,EXITS4,",
"DATE5,TIME5,DESC5,ENTRIES5,EXITS5,DATE6,TIME6,DESC6,",
"ENTRIES6,EXITS6,DATE7,TIME7,DESC7,ENTRIES7,EXITS7,",
"DATE8,TIME8,DESC8,ENTRIES8,EXITS8"))
## What are the unique variable "stubs"?
isRepeated <- unique(gsub("\d", "", Names[4:length(Names)]))
接下来,我们编写一个使用上述常量的函数。该函数执行以下操作:
- 使用
stri_split_fixed
将vector
拆分为matrix
。 - 修剪任何多余的空格。
- 将
matrix
转换为data.table
并分配相关名称。 - 使用
merged.stack
从 "wide" 形式变为 "semi-long" 形式。
函数如下:
funAM <- function(invec) {
temp <- stri_split_fixed(invec, ",", simplify = TRUE)
temp <- `dim<-`(stri_trim_both(temp), dim(temp))
DT <- setnames(as.data.table(temp), Names)
merged.stack(getanID(DT, 1:3), var.stubs = isRepeated,
sep = "var.stubs")
}
让我们试试看:
## Try a dataset where we know there are unbalanced numbers of observations...
data <- readLines("http://web.mta.info/developers/data/nyct/turnstile/turnstile_130615.txt")
我们将在刚刚创建的 data
对象上应用函数:
system.time(out <- funAM(data)) ## Reasonably fast
# user system elapsed
# 1.25 0.02 1.29
out
# C/A UNIT SCP .id .time_1 DATE TIME DESC ENTRIES EXITS
# 1: A002 R051 02-00-00 1 1 06-08-13 00:00:00 REGULAR 004153504 001427135
# 2: A002 R051 02-00-00 1 2 06-08-13 04:00:00 REGULAR 004153535 001427138
# 3: A002 R051 02-00-00 1 3 06-08-13 08:00:00 REGULAR 004153559 001427177
# 4: A002 R051 02-00-00 1 4 06-08-13 12:00:00 REGULAR 004153683 001427255
# 5: A002 R051 02-00-00 1 5 06-08-13 16:00:00 REGULAR 004153959 001427320
# ---
# 241492: TRAM2 R469 00-05-01 6 4
# 241493: TRAM2 R469 00-05-01 6 5
# 241494: TRAM2 R469 00-05-01 6 6
# 241495: TRAM2 R469 00-05-01 6 7
# 241496: TRAM2 R469 00-05-01 6 8
与@hrbmstr 的方法相比,时间安排如下:
funHRB <- function() {
starts <- seq(from=4, to=43, by=5)
new_data <- rbindlist(lapply(strsplit(data, ","), function(x) {
rbindlist(lapply(starts, function(y) {
as.list(x[c(1:3, y:(y+4))])
}))
}))
setnames(new_data, colnames(new_data),
c("control_area", "unit", "scp", "date",
"time", "description", "entries", "exits"))
new_data
}
system.time(out2 <- funHRB())
# user system elapsed
# 23.59 0.03 23.77
同样通过比较,这两种方法都比我假设的 OP 的第一步要快得多,我假设是使用 read.csv
或类似的方法首先将数据导入 R地方。对我来说,使用相同的数据集大约需要一分钟:
system.time(DF <- read.csv(
header = FALSE, col.names = Names,
strip.white = TRUE,
colClasses = rep("character", length(Names)),
text = data))
# user system elapsed
# 66.01 0.07 66.91