在 R 中读取多个分隔的 .txt 文件，散布 headers

Question

我正在尝试在 R 中打开和清理一个庞大的海洋学数据集，其中台站信息散布在 headers 观察块之间：

$
 2008    1  774  8 17  5 11  2   78.4952    6.0375 30  7    1.2 -999.0 -9 -9 -9 -9 4868.8 2017  0  7114
    2.0    6.0297   35.0199   34.4101    2.0 11111
    3.0    6.0279   35.0201   34.4091    3.0 11111
    4.0    6.0272   35.0203   34.4091    4.0 11111
    5.0    6.0273   35.0204   34.4097    4.9 11111
    6.0    6.0274   35.0205   34.4104    5.9 11111
$
 2008    1  777  8 17 12  7 25   78.4738    8.3510 27  6    4.1 -999.0  3  7  2  0 4903.8 1570  0  7114
    3.0    6.4129   34.5637   34.3541    3.0 11111
    4.0    6.4349   34.5748   34.3844    4.0 11111
    5.0    6.4803   34.5932   34.4426    4.9 11111
    6.0    6.4139   34.5624   34.3552    5.9 11111
    7.0    6.5079   34.6097   34.4834    6.9 11111

每个 $ 后跟一行包含站数据（例如年、...、纬度、经度、日期、时间），然后是几行包含在该站采样的观测值（例如深度、温度、盐度等）。

我想将站数据添加到观测中，这样每个变量就是一列每个观察都是一行，如下所示：

2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    2   6.0297  35.0199 34.4101 2   11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    3   6.0279  35.0201 34.4091 3   11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    4   6.0272  35.0203 34.4091 4   11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    5   6.0273  35.0204 34.4097 4.9 11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    6   6.0274  35.0205 34.4104 5.9 11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    3   6.4129  34.5637 34.3541 3   11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    4   6.4349  34.5748 34.3844 4   11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    5   6.4803  34.5932 34.4426 4.9 11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    6   6.4139  34.5624 34.3552 5.9 11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    7   6.5079  34.6097 34.4834 6.9 11111

Answer 1

这个解决方案非常复杂，并且依赖于对几个 Tidyverse 库和功能的了解。我不确定它对您的需求有多稳健，但它确实可以处理您发布的示例。但是折叠块的方法，创建函数来解析较小的块，然后展开结果我认为对你很有帮助。

第一部分涉及找到“$”标记，将后续行组合在一起，然后 "nesting" 将数据块组合在一起。然后我们有一个只有几行的数据框 - 每个部分一行。

library(tidyverse)
txt_lns <- readLines("ocean-sample.txt") 

txt <- tibble(txt = txt_lns)

# Start by finding new sections, and nesting the data
nested_txt <- txt %>%
  mutate(row_number = row_number()) %>%
  mutate(new_section = str_detect(txt, "\$")) %>%            # Mark new sections
  mutate(starting = ifelse(new_section, row_number, NA)) %>%  # Index with row num
  tidyr::fill(starting) %>%                                   # Fill index down
                                                              # where missing
  select(-new_section) %>%                                    # Clean up
  filter(!str_detect(txt, "\$")) %>%                         
  nest(data = c(txt, row_number))                             # "Nest" the data

# Take a quick look
nested_txt

然后，我们需要能够处理那些嵌套块。此处的例程通过识别 header 行来解析这些块，然后将字段分成它们自己的数据帧。在这里，我们对 header 行与较短的较小行有不同的逻辑。

# Deal with the records within a section
parse_inner_block <- function(x, header_ind) {
  if (header_ind) {
    df <- x %>%
      mutate(txt = str_trim(txt)) %>%
      # Separate the header row into 22 variables
      separate(txt, into = LETTERS[1:22], sep = "\s+")
  } else {
    df <- x %>%
      mutate(txt = str_trim(txt)) %>% 
      # Separate the lesser rows into 6 variables
      separate(txt, into  = letters[1:6], sep = "\s+")
  }
  return(df)
}

parse_outer_block <- function(x) {
  df <- x %>%
    # Determine if it's a header row with 22 variables or lesser row with 6
    mutate(leading_row = (row_number == min(row_number))) %>%
    # Fold by header row vs. not
    nest(data = c(txt, row_number)) %>%
    # Create data frames for both header and lesser rows
    mutate(processed = purrr::map2(data, leading_row, parse_inner_block)) %>%
    unnest(processed) %>%
    # Copy header row values to lesser rows
    tidyr::fill(A:V) %>%
    # Drop header row
    filter(!leading_row)
  return(df)
}

然后我们可以将它们放在一起——从我们的嵌套数据开始，处理每个块，取消嵌套返回的字段，并准备完整的输出。

# Actually put all this together and generate an output dataframe
output <- nested_txt %>%
  mutate(proc_out = purrr::map(data, parse_outer_block)) %>%
  select(-data) %>%
  unnest(proc_out) %>%
  select(-starting, -leading_row, -data, -row_number)

output

希望对您有所帮助。对于一些类似的问题，我建议您也查看一些 purrr 教程。

Answer 2

这个比较简单，只依赖于基础R。我假设你已经先阅读了x <- readLines(....)的文本文件：

start <- which(x == "$") + 1             # Find header indices
rows <- diff(c(start, length(x)+2)) - 2  # Find number of lines per group
# Function to read header and rows and cbind
getdata <- function(begin, end) {
    cbind(read.table(text=x[begin]), read.table(text=x[(begin+1):(begin+end)]))
}
dta.list <- lapply(1:(length(start)), function(i) getdata(start[i], rows[i]))
dta.df <- do.call(rbind, dta.list)

这适用于您在 post 中包含的两个组。您需要修复列名，因为 V1 - V6 在开头和结尾重复。

在 R 中读取多个分隔的 .txt 文件，散布 headers

read delimited .txt file with multiple, interspersed headers in R

file-io

r

tidyr

tidytext