有没有办法在 R 中打开 .csv 跳过前 X 行,其中 X 是可变的,基于可以找到指定 headers 的位置?

Is there a way to open .csv in R skipping the first X rows, where X is variable based on where specified headers can be found?

我正在尝试读取一个文件夹中的多个 .csv 文件,并将所有数据合并到一个数据框中以进行分析和绘图。通常,我会使用这种方法来加载和组合所有文件。

    file_list <- list.files(paste(WorkingDirectory, "/Transducer Data", sep= ""), pattern = "*.csv", 
    full.names = TRUE)

    for (file in file_list){
       all_transducer_file <- read.csv(file, header = F, as.is = T, sep= ",", skip = 15) 
     }

但是,我遇到了几个问题。

  1. 生成的 .csv 在数据之前有不同的行数。数据的 headers 似乎总是:“日期和时间”、“秒数”、“压力 (PSI)”和“地表水位 (ft)”。行数根据自上次数据拉取后设备抛出的错误数而有所不同。

  2. 数据有时加载为“chr”类型,有时加载为“factor”类型。我真的不明白它们之间的区别或它们如何影响编码。

有没有一种方法可以跳过前 X 行打开 csv,其中 X 是基于可以找到指定 headers 的位置?

既然你知道 Date and Time 出现在 header 中,试试这个:

library(data.table)
fread(filename, skip = "Date and Time")

请参阅 ?fread 了解您可能需要或可能不需要的其他参数。

所以这里有一个解决手头问题的方法;

问题与解决方案:

  1. 不知道 skip 从哪里开始 -> 使用 grep 获取列名开始的行
  2. 一些列变成因子,一些字符 -> 使用 read_csv 或在 read.csv
  3. 中设置 stringsAsFactors = FALSE

获取文件名并跳过行

# Setting the file path which contains the csv data
file_list <- 
  list.files(paste(WorkingDirectory, "/Transducer Data", sep= ""), pattern = "*.csv", 
             full.names = TRUE)

# Here we get the line at which the table we want starts
# sapply is used to loop on each file we have
# grep("Date and Time", readr::read_lines(x))[1] -> reads lines of data and get row at which Date time exist
# We minus this row by one to use it as skip number
skip_lines <- 
  sapply(file_list, function(x){grep("Date and Time", readr::read_lines(x))[1] - 1}, 
         USE.NAMES = FALSE)

读取数据

# Here I am using purrr to loop on data but you can use
# a normal loop or apply family, the benefit of map_df (function in purrr)
# is that it automatically returns data as a dataframe without needing to bind it
library(purrr)

# Method one using read.csv
1:length(file_list) %>% # I am looping on the files
  map_df(function(x){
    # For each file we read it skipping number of rows in skip_lines vector
    # stringsAsFactors = FALSE -> to avoid conversion of any column to factor (both character and factor will be character)
    read.csv(file_list[x], skip = skip_lines[x], stringsAsFactors = FALSE)
  })

# Method two using read_csv
1:length(file_list) %>%
  map_df(function(x){
    readr::read_csv(file_list[x], skip = skip_lines[x], col_types = cols())
  })