通过 files/folders 加速 R 脚本循环以检查阈值、计算平均值和绘图
Speed up R script looping through files/folders to check thresholds, calculate averages, and plot
我正在尝试加速 R 中的一些代码。我认为我的循环方法可以被替换(也许用某种形式的 lapply 或使用 sqldf)但我似乎无法弄清楚如何.
基本前提是我有一个包含约 50 个子目录的父目录,每个子目录包含约 200 个 CSV 文件(总共 10,000 个 CSV)。这些 CSV 文件中的每一个都包含约 86,400 行(数据是按秒计算的每日数据)。
脚本的目标是计算每个文件两个时间间隔的均值和标准差,然后为每个子目录制作一个汇总图,如下所示:
library(timeSeries)
library(ggplot2)
# list subdirectories in parent directory
dir <- list.dirs(path = "/ParentDirectory", full.names = TRUE, recursive = FALSE)
num <- (length(dir))
# iterate through all subdirectories
for (idx in 1:num){
# declare empty vectors to fill for each subdirectory
DayVal <- c()
DayStd <- c()
NightVal <- c()
NightStd <- c()
date <- as.Date(character())
setwd(dir[idx])
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
# for each file in the subdirectory
for (i in c(1:numfiles)){
day <- read.csv(filenames[i], sep = ',')
today <- as.Date(day$time[1], "%Y-%m-%d")
# setting interval for times of day we care about <- SQL seems like it may be useful here but I couldn't get read.csv.sql to recognize hourly intervals
nightThreshold <- as.POSIXct(paste(today, "03:00:00"))
dayThreshold <- as.POSIXct(paste(today, "15:00:00"))
nightInt <- day[(as.POSIXct(day$time) >= nightThreshold & as.POSIXct(day$time) <= (nightThreshold + 3600)) , ]
dayInt <- day[(as.POSIXct(day$time) >= dayThreshold & as.POSIXct(day$time) <= (dayThreshold + 3600)) , ]
#check some thresholds in the data for that time period
if (sum(nightInt$val, na.rm=TRUE) < 5){
NightMean <- mean(nightInt$val, na.rm =TRUE)
NightSD <-sd(nightInt$val, na.rm =TRUE)
} else {
NightMean <- NA
NightSD <- NA
}
if (sum(dayInt$val, na.rm=TRUE) > 5){
DayMean <- mean(dayInt$val, na.rm =TRUE)
DaySD <-sd(dayInt$val, na.rm =TRUE)
} else {
DayMean <- NA
DaySD <- NA
}
NightVal <- c(NightVal, NightMean)
NightStd <- c(NightStd, NightSD)
DayVal <- c(gsrDayVal, DayMean)
DayStd <- c(gsrDayStd, DaySD)
date <-c(date, as.Date(today))
}
df<-data.frame(date,DayVal,DayStd,NightVal, NightStd)
# plot for the subdirectory
p1 <- ggplot() +
geom_point(data = df, aes(x = date, y = gsrDayVal, color = "Day Average")) +
geom_point(data = df, aes(x = date, y = gsrDayStd, color = "Day Standard Dev")) +
geom_point(data = df, aes(x = date, y = gsrNightVal, color = "Night Average")) +
geom_point(data = df, aes(x = date, y = gsrNightStd, color = "Night Standard Dev")) +
scale_colour_manual(values = c("steelblue", " turquoise3", "purple3", "violet"))
}
非常感谢您提供的任何建议!
考虑一个 SQL 数据库解决方案,因为您在平面文件中管理相当多的数据。关系数据库管理系统 (RDMS) 可以轻松处理数百万条记录,甚至可以根据需要使用其可扩展的数据库引擎进行聚合,而不是按照 R 在内存中进行处理。如果不是为了速度和效率,数据库可以提供安全性、稳健性和组织性,因为中央存储库。甚至编写一个脚本,将每个每日 csv 直接导入数据库。
幸运的是,几乎所有的 RDMS 都有 CSV 处理程序并且可以批量加载多个文件。以下是开源解决方案:SQLite (file level database), MySQL, and PostgreSQL(都是服务器级数据库),所有这些都在 R 中有相应的库。每个示例递归地将文件目录列表中的 csv 文件导入到名为 timeseriesdata
的数据库 table(与 csv 文件具有相同的命名 fields/data 类型)。最后是一个 SQL 调用,用于导入夜间和白天间隔平均值和标准差的聚合(根据需要调整)。唯一的挑战是指定一个文件和子目录指示器(实际数据中可能存在也可能不存在)并附加 csv 文件(可能在每次迭代后,运行 对 FileID
列的更新查询)。
dir <- list.dirs(path = "/ParentDirectory",
full.names = TRUE, recursive = FALSE)
# SQLITE DATABASE
library(RSQLite)
sqconn <- dbConnect(RSQLite::SQLite(), dbname = "/path/to/database.db")
# (CONNECTION NOT NEEDED DUE TO CMD LINE LOAD BELOW)
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT VIA COMMAND LINE OR BASH (ASSUMES SQLITE3 IS PATH VARIABLE)
cmd <- paste0("(echo .separator ,; echo .import ' ", csvfile , " ' timeseriesdata ')",
" '| sqlite3 ' /path/to/databasename.db")
system(cmd)
}
}
# CLOSE CONNNECTION
dbDisconnect(sqconn)
# MYSQL DATABASE
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING LOAD DATA INFILE COMMAND
sql <- paste0("LOAD DATA INFILE '", csvfile, "'
INTO TABLE timeseriesdata
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
ESCAPED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(col1, col2, col3, col4, col5);")
dbSendQuery(myconn, sql)
dbCommit(myconn)
}
}
# CLOSE CONNECTION
dbDisconnect(myconn)
# POSTGRESQL DATABASE
library(RPostgreSQL)
pgconn <- dbConnect(PostgreSQL(), dbname="databasename", host="myhost",
user= "postgres", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING COPY COMMAND
sql <- paste("COPY timeseriesdata(col1, col2, col3, col4, col5)
FROM '", csvfile , "' DELIMITER ',' CSV;")
dbSendQuery(pgconn, sql)
}
}
# CLOSE CONNNECTION
dbDisconnect(pgconn)
# CREATE PLOT DATA FRAME (MYSQL EXAMPLE)
# (ADD INSIDE SUBDIRECTORY LOOP OR INCLUDE SUBDIR COLUMN IN GROUP BY)
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
# AGGREGATE QUERY USING TWO DERIVED TABLE SUBQUERIES
# (FOR NIGHT AND DAY, ADJUST FILTERS PER NEEDS)
strSQL <- "SELECT dt.FileID, NightMean, NightSTD, DayMean, DaySTD
FROM
(SELECT nt.FileID, Avg(nt.time) As NightMean, STDDEV(nt.time) As NightSTD
FROM timeseriesdata nt
WHERE nt.time >= '15:00:00' AND nt.time <= '21:00:00'
GROUP BY nt.FileID
HAVING Sum(nt.val) < 5) AS ng
INNER JOIN
(SELECT dt.FileID, Avg(dt.time) As DayMean, STDDEV(dt.time) As DaySTD
FROM timeseriesdata dt
WHERE dt.time >= '03:00:00' AND dt.time <= '09:00:00'
GROUP BY dt.FileID
HAVING Sum(dt.val) > 5) AS dy
ON ng.FileID = dy.FileID;"
df <- dbSendQuery(myconn, strSQL)
dbFetch(df)
dbDisconnect(myconn)
一件事是将 day$time 转换一次,而不是您现在一直在转换。还要使用 lubridate 包,因为如果你有大量的转换次数,它比 'as.POSIXct'.
快得多
还将存储结果的变量(例如 DayVal、DayStd)调整为适当的大小 (DayVal <- numeric(num)),然后将结果索引到适当的索引中。
如果 CSV 文件很大,请考虑使用 data.table 包中的 'fread' 函数。
我正在尝试加速 R 中的一些代码。我认为我的循环方法可以被替换(也许用某种形式的 lapply 或使用 sqldf)但我似乎无法弄清楚如何.
基本前提是我有一个包含约 50 个子目录的父目录,每个子目录包含约 200 个 CSV 文件(总共 10,000 个 CSV)。这些 CSV 文件中的每一个都包含约 86,400 行(数据是按秒计算的每日数据)。
脚本的目标是计算每个文件两个时间间隔的均值和标准差,然后为每个子目录制作一个汇总图,如下所示:
library(timeSeries)
library(ggplot2)
# list subdirectories in parent directory
dir <- list.dirs(path = "/ParentDirectory", full.names = TRUE, recursive = FALSE)
num <- (length(dir))
# iterate through all subdirectories
for (idx in 1:num){
# declare empty vectors to fill for each subdirectory
DayVal <- c()
DayStd <- c()
NightVal <- c()
NightStd <- c()
date <- as.Date(character())
setwd(dir[idx])
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
# for each file in the subdirectory
for (i in c(1:numfiles)){
day <- read.csv(filenames[i], sep = ',')
today <- as.Date(day$time[1], "%Y-%m-%d")
# setting interval for times of day we care about <- SQL seems like it may be useful here but I couldn't get read.csv.sql to recognize hourly intervals
nightThreshold <- as.POSIXct(paste(today, "03:00:00"))
dayThreshold <- as.POSIXct(paste(today, "15:00:00"))
nightInt <- day[(as.POSIXct(day$time) >= nightThreshold & as.POSIXct(day$time) <= (nightThreshold + 3600)) , ]
dayInt <- day[(as.POSIXct(day$time) >= dayThreshold & as.POSIXct(day$time) <= (dayThreshold + 3600)) , ]
#check some thresholds in the data for that time period
if (sum(nightInt$val, na.rm=TRUE) < 5){
NightMean <- mean(nightInt$val, na.rm =TRUE)
NightSD <-sd(nightInt$val, na.rm =TRUE)
} else {
NightMean <- NA
NightSD <- NA
}
if (sum(dayInt$val, na.rm=TRUE) > 5){
DayMean <- mean(dayInt$val, na.rm =TRUE)
DaySD <-sd(dayInt$val, na.rm =TRUE)
} else {
DayMean <- NA
DaySD <- NA
}
NightVal <- c(NightVal, NightMean)
NightStd <- c(NightStd, NightSD)
DayVal <- c(gsrDayVal, DayMean)
DayStd <- c(gsrDayStd, DaySD)
date <-c(date, as.Date(today))
}
df<-data.frame(date,DayVal,DayStd,NightVal, NightStd)
# plot for the subdirectory
p1 <- ggplot() +
geom_point(data = df, aes(x = date, y = gsrDayVal, color = "Day Average")) +
geom_point(data = df, aes(x = date, y = gsrDayStd, color = "Day Standard Dev")) +
geom_point(data = df, aes(x = date, y = gsrNightVal, color = "Night Average")) +
geom_point(data = df, aes(x = date, y = gsrNightStd, color = "Night Standard Dev")) +
scale_colour_manual(values = c("steelblue", " turquoise3", "purple3", "violet"))
}
非常感谢您提供的任何建议!
考虑一个 SQL 数据库解决方案,因为您在平面文件中管理相当多的数据。关系数据库管理系统 (RDMS) 可以轻松处理数百万条记录,甚至可以根据需要使用其可扩展的数据库引擎进行聚合,而不是按照 R 在内存中进行处理。如果不是为了速度和效率,数据库可以提供安全性、稳健性和组织性,因为中央存储库。甚至编写一个脚本,将每个每日 csv 直接导入数据库。
幸运的是,几乎所有的 RDMS 都有 CSV 处理程序并且可以批量加载多个文件。以下是开源解决方案:SQLite (file level database), MySQL, and PostgreSQL(都是服务器级数据库),所有这些都在 R 中有相应的库。每个示例递归地将文件目录列表中的 csv 文件导入到名为 timeseriesdata
的数据库 table(与 csv 文件具有相同的命名 fields/data 类型)。最后是一个 SQL 调用,用于导入夜间和白天间隔平均值和标准差的聚合(根据需要调整)。唯一的挑战是指定一个文件和子目录指示器(实际数据中可能存在也可能不存在)并附加 csv 文件(可能在每次迭代后,运行 对 FileID
列的更新查询)。
dir <- list.dirs(path = "/ParentDirectory",
full.names = TRUE, recursive = FALSE)
# SQLITE DATABASE
library(RSQLite)
sqconn <- dbConnect(RSQLite::SQLite(), dbname = "/path/to/database.db")
# (CONNECTION NOT NEEDED DUE TO CMD LINE LOAD BELOW)
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT VIA COMMAND LINE OR BASH (ASSUMES SQLITE3 IS PATH VARIABLE)
cmd <- paste0("(echo .separator ,; echo .import ' ", csvfile , " ' timeseriesdata ')",
" '| sqlite3 ' /path/to/databasename.db")
system(cmd)
}
}
# CLOSE CONNNECTION
dbDisconnect(sqconn)
# MYSQL DATABASE
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING LOAD DATA INFILE COMMAND
sql <- paste0("LOAD DATA INFILE '", csvfile, "'
INTO TABLE timeseriesdata
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
ESCAPED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(col1, col2, col3, col4, col5);")
dbSendQuery(myconn, sql)
dbCommit(myconn)
}
}
# CLOSE CONNECTION
dbDisconnect(myconn)
# POSTGRESQL DATABASE
library(RPostgreSQL)
pgconn <- dbConnect(PostgreSQL(), dbname="databasename", host="myhost",
user= "postgres", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING COPY COMMAND
sql <- paste("COPY timeseriesdata(col1, col2, col3, col4, col5)
FROM '", csvfile , "' DELIMITER ',' CSV;")
dbSendQuery(pgconn, sql)
}
}
# CLOSE CONNNECTION
dbDisconnect(pgconn)
# CREATE PLOT DATA FRAME (MYSQL EXAMPLE)
# (ADD INSIDE SUBDIRECTORY LOOP OR INCLUDE SUBDIR COLUMN IN GROUP BY)
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
# AGGREGATE QUERY USING TWO DERIVED TABLE SUBQUERIES
# (FOR NIGHT AND DAY, ADJUST FILTERS PER NEEDS)
strSQL <- "SELECT dt.FileID, NightMean, NightSTD, DayMean, DaySTD
FROM
(SELECT nt.FileID, Avg(nt.time) As NightMean, STDDEV(nt.time) As NightSTD
FROM timeseriesdata nt
WHERE nt.time >= '15:00:00' AND nt.time <= '21:00:00'
GROUP BY nt.FileID
HAVING Sum(nt.val) < 5) AS ng
INNER JOIN
(SELECT dt.FileID, Avg(dt.time) As DayMean, STDDEV(dt.time) As DaySTD
FROM timeseriesdata dt
WHERE dt.time >= '03:00:00' AND dt.time <= '09:00:00'
GROUP BY dt.FileID
HAVING Sum(dt.val) > 5) AS dy
ON ng.FileID = dy.FileID;"
df <- dbSendQuery(myconn, strSQL)
dbFetch(df)
dbDisconnect(myconn)
一件事是将 day$time 转换一次,而不是您现在一直在转换。还要使用 lubridate 包,因为如果你有大量的转换次数,它比 'as.POSIXct'.
快得多还将存储结果的变量(例如 DayVal、DayStd)调整为适当的大小 (DayVal <- numeric(num)),然后将结果索引到适当的索引中。
如果 CSV 文件很大,请考虑使用 data.table 包中的 'fread' 函数。