读取 delimiter/fixed 宽度不一致的文件列表

Read list of files with inconsistent delimiter/fixed width

我正在尝试找到一种更有效的方法来导入具有某种笨拙结构的数据文件列表。这些文件是由软件程序生成的,看起来像是要打印和查看而不是导出和使用。该文件包含一个“化合物”列表,然后是一些相关数据。在一行“化合物 X:XXXX”之后,有一行制表符分隔的数据。在每个文件中,每个化合物的行数保持不变,但行数可能会随着文件的不同而变化。

这是一些示例数据:

#Generate two data files to be imported
 cat("Quantify Compound Summary Report\n", 
    "\nPrinted Mon March 28 14:54:39 2022\n", 
    "\nCompound 1: One\n", 
    "\tName\tID\tResult", 
    "\n1\tA1234\tQC\t25.2", 
    "\n2\tA4567\tQC\t26.8\n", 
    "\nCompound 2: Two\n", 
    "\tName\tID\tResult", 
    "\n1\tA1234\tQC\t51.1", 
    "\n2\tA4567\tQC\t48.6\n",
    file = "test1.txt")
 cat("Quantify Compound Summary Report\n", 
    "\nPrinted Mon March 28 14:54:39 2022\n", 
    "\nCompound 1: One\n", 
    "\tName\tID\tResult", 
    "\n1\tC1234\tQC\t25.2", 
    "\n2\tC4567\tQC\t26.8", 
    "\n3\tC8910\tQC\t25.4\n", 
    "\nCompound 2: Two\n", 
    "\tName\tID\tResult", 
    "\n1\tC1234\tQC\t51.1", 
    "\n2\tC4567\tQC\t48.6",
    "\n3\tC8910\tQC\t45.6\n",
    file = "test2.txt")

我最终想要的是一个数据框列表,每个“化合物”一个,包含与每个化合物相关的所有数据行。为了实现这一目标,我采用了一种相当复杂的方法,将函数组合在一起,以一种非常不守规矩的方式提供我想要的东西。

library(tidyverse)

## Step 1: ID list of data files
data.files <- list.files(path = ".",
                         pattern = ".txt",
                         full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4) 

## Step 3: Identify the "compounds" in the data file output  
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)

## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)

## Step 5: Curate the list of compounds - remove "Compound X: " 
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13))) 

## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre 
NameCols <- function(z) lapply(names(z), function(i){
  x <- z[[ i ]]
  colnames(x) <- x[2,]
  x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list) 

## Step 7: rbind the data based on the compound 
cmpd_names <- unique(unlist(sapply(data.list, names)))

result <- list()
j <- for (n in cmpd_names) {
  result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)

list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))

这里的挑战是就时间而言的脚本效率(我可以导入数百或数千个包含数百行数据的数据文件,这可能需要相当长的时间)以及一般的“清洁度”,这就是为什么我在此处包含 tidyverse 作为标签。我还希望它具有高度的普遍性,因为“化合物”可能会随着时间而改变。如果有人能想出一种简洁有效的方法来完成所有这些工作,我将永远感激您。

请参阅下面的一种方法。乍一看,整个管道可能令人生畏。您可以在每个步骤 (%>%) 之后插入一个 head(或 tail)调用,以显示数据转换的当前阶段。 正则表达式gsub 中进行了一些清理:根据需要进行修改。

intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
    rowwise %>%
    ## read file content into a raw string:
    mutate(raw = read_file(file_name)) %>%
    ## separate raw file contents into rows 
    ## using newline and carriage return as row delimiters:
    separate_rows(raw, sep = '[\n\r]') %>%
    ## provide a compound column for later grouping
    ## by extracting the 'Compound' string from column raw
    ## or setting the compound column to NA otherwise:
    mutate(compound = ifelse(grepl('^Compound',raw),
                             gsub('.*(Compound .*):.*','\1', raw),
                             NA)
           ) %>%
    ## remove rows with empty raw text:
    filter(raw != '') %>%
    ## filling missing compound values (NAs) with last non-NA compound string:
    fill(compound, .direction = 'down') %>%
    ## keep only rows with tab-separated raw string
    ## indicating tabular data
    filter(grepl('\t',raw)) %>%
    ## insert a column header 'Index' because
    ## original format has four data columns but only three header cols:
    mutate(raw = gsub(' *\tName','Index\tName',raw))

以上步骤生成了一个数据框,其列 'raw' 包含 cleaned-up 数据作为适合转换为表格数据的字符串(tab-delimited,换行符)。 从那时起,我们可以继续将未来的单身 table 作为 so-called list column(变体 A)保留并存放在父 table 中,或者继续拆分列'raw' 并将其映射(变体 B, 归功于 @Dorton)。

变体 A 在数据帧内生成一列数据帧:

intermediate_result %>%
   group_by(compound) %>%
    ## the nifty piece: you can store dataframes inside a dataframe:
    mutate(
        tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
    )

变体 B 生成以相应化合物命名的数据帧列表:

intermediate_result %>%
    split(f = as.factor(.$compound)) %>% 
    lapply(function(x) x %>%
                       separate(raw,
                                into = unlist(
                                    str_split(x$raw[1], pattern = "\t"))
                                )
           )