有没有办法在 R 中打开 .csv 跳过前 X 行,其中 X 是可变的,基于可以找到指定 headers 的位置?
Is there a way to open .csv in R skipping the first X rows, where X is variable based on where specified headers can be found?
我正在尝试读取一个文件夹中的多个 .csv 文件,并将所有数据合并到一个数据框中以进行分析和绘图。通常,我会使用这种方法来加载和组合所有文件。
file_list <- list.files(paste(WorkingDirectory, "/Transducer Data", sep= ""), pattern = "*.csv",
full.names = TRUE)
for (file in file_list){
all_transducer_file <- read.csv(file, header = F, as.is = T, sep= ",", skip = 15)
}
但是,我遇到了几个问题。
生成的 .csv 在数据之前有不同的行数。数据的 headers 似乎总是:“日期和时间”、“秒数”、“压力 (PSI)”和“地表水位 (ft)”。行数根据自上次数据拉取后设备抛出的错误数而有所不同。
数据有时加载为“chr”类型,有时加载为“factor”类型。我真的不明白它们之间的区别或它们如何影响编码。
有没有一种方法可以跳过前 X 行打开 csv,其中 X 是基于可以找到指定 headers 的位置?
既然你知道 Date and Time
出现在 header 中,试试这个:
library(data.table)
fread(filename, skip = "Date and Time")
请参阅 ?fread
了解您可能需要或可能不需要的其他参数。
所以这里有一个解决手头问题的方法;
问题与解决方案:
- 不知道 skip 从哪里开始 -> 使用 grep 获取列名开始的行
- 一些列变成因子,一些字符 -> 使用 read_csv 或在 read.csv
中设置 stringsAsFactors = FALSE
获取文件名并跳过行
# Setting the file path which contains the csv data
file_list <-
list.files(paste(WorkingDirectory, "/Transducer Data", sep= ""), pattern = "*.csv",
full.names = TRUE)
# Here we get the line at which the table we want starts
# sapply is used to loop on each file we have
# grep("Date and Time", readr::read_lines(x))[1] -> reads lines of data and get row at which Date time exist
# We minus this row by one to use it as skip number
skip_lines <-
sapply(file_list, function(x){grep("Date and Time", readr::read_lines(x))[1] - 1},
USE.NAMES = FALSE)
读取数据
# Here I am using purrr to loop on data but you can use
# a normal loop or apply family, the benefit of map_df (function in purrr)
# is that it automatically returns data as a dataframe without needing to bind it
library(purrr)
# Method one using read.csv
1:length(file_list) %>% # I am looping on the files
map_df(function(x){
# For each file we read it skipping number of rows in skip_lines vector
# stringsAsFactors = FALSE -> to avoid conversion of any column to factor (both character and factor will be character)
read.csv(file_list[x], skip = skip_lines[x], stringsAsFactors = FALSE)
})
# Method two using read_csv
1:length(file_list) %>%
map_df(function(x){
readr::read_csv(file_list[x], skip = skip_lines[x], col_types = cols())
})
我正在尝试读取一个文件夹中的多个 .csv 文件,并将所有数据合并到一个数据框中以进行分析和绘图。通常,我会使用这种方法来加载和组合所有文件。
file_list <- list.files(paste(WorkingDirectory, "/Transducer Data", sep= ""), pattern = "*.csv",
full.names = TRUE)
for (file in file_list){
all_transducer_file <- read.csv(file, header = F, as.is = T, sep= ",", skip = 15)
}
但是,我遇到了几个问题。
生成的 .csv 在数据之前有不同的行数。数据的 headers 似乎总是:“日期和时间”、“秒数”、“压力 (PSI)”和“地表水位 (ft)”。行数根据自上次数据拉取后设备抛出的错误数而有所不同。
数据有时加载为“chr”类型,有时加载为“factor”类型。我真的不明白它们之间的区别或它们如何影响编码。
有没有一种方法可以跳过前 X 行打开 csv,其中 X 是基于可以找到指定 headers 的位置?
既然你知道 Date and Time
出现在 header 中,试试这个:
library(data.table)
fread(filename, skip = "Date and Time")
请参阅 ?fread
了解您可能需要或可能不需要的其他参数。
所以这里有一个解决手头问题的方法;
问题与解决方案:
- 不知道 skip 从哪里开始 -> 使用 grep 获取列名开始的行
- 一些列变成因子,一些字符 -> 使用 read_csv 或在 read.csv 中设置 stringsAsFactors = FALSE
获取文件名并跳过行
# Setting the file path which contains the csv data
file_list <-
list.files(paste(WorkingDirectory, "/Transducer Data", sep= ""), pattern = "*.csv",
full.names = TRUE)
# Here we get the line at which the table we want starts
# sapply is used to loop on each file we have
# grep("Date and Time", readr::read_lines(x))[1] -> reads lines of data and get row at which Date time exist
# We minus this row by one to use it as skip number
skip_lines <-
sapply(file_list, function(x){grep("Date and Time", readr::read_lines(x))[1] - 1},
USE.NAMES = FALSE)
读取数据
# Here I am using purrr to loop on data but you can use
# a normal loop or apply family, the benefit of map_df (function in purrr)
# is that it automatically returns data as a dataframe without needing to bind it
library(purrr)
# Method one using read.csv
1:length(file_list) %>% # I am looping on the files
map_df(function(x){
# For each file we read it skipping number of rows in skip_lines vector
# stringsAsFactors = FALSE -> to avoid conversion of any column to factor (both character and factor will be character)
read.csv(file_list[x], skip = skip_lines[x], stringsAsFactors = FALSE)
})
# Method two using read_csv
1:length(file_list) %>%
map_df(function(x){
readr::read_csv(file_list[x], skip = skip_lines[x], col_types = cols())
})