使用 dplyr 实用程序将纯文本数据重塑为常规表格数据的任何方法？

Question

我有纯文本 ASCII 格式 (please take a look original raw data on the fly) 的网格化数据，所有数据观察都是在每天的水平上进行的，每年的数据都是在网格中收集的。但是，我想重建这些数据，因为我想做年度统计。为此，我需要在类似矩阵的表格数据中重建此纯文本数据，其中每日数据观察将位于新列中，因此进行年度平均会更容易。

更新:

因为原始原始纯文本数据((please take a look original raw data on the fly))比较大，这里我只放原始数据的概览。

更新 2:

我将 ASCII 中的原始原始纯数据导入 R，下面是 R 脚本：

rawdata = read.table(file = "~/25_krig_all_1980", header = FALSE, fill = TRUE, comment.char="Y", stringsAsFactors=FALSE )
colnames(rawdata) = c("long", "lat", "precip", "err1", "err2")

notepad++ 中原始原始纯文本数据的骨架如下所示 (please take a look on clipped raw data on the fly)：

1980   1   1   1
      6.125 47.375     0.0    20.00     1.0
      6.375 47.375     0.0    19.99     1.0
      6.625 47.375     0.0    19.97     1.0
      6.875 47.375     0.0    19.84     1.0
      7.125 47.375     0.0    20.00     1.0
 1980   1   2   2
      6.125 47.375     1.5    20.00     1.0
      6.375 47.375     1.5    19.99     1.0
      6.625 47.375     1.5    19.97     1.0
      6.875 47.375     1.5    19.84     1.0
      7.125 47.375     2.9    20.00     1.0
 1980   1   3   3
      6.125 47.375     3.3    20.00     1.0
      6.375 47.375     3.3    19.99     1.0
      6.625 47.375     3.3    19.97     1.0
      6.875 47.375     3.3    19.84     1.0
      7.125 47.375     1.3    20.00     1.0
 1980   1   4   4
      6.125 47.375     3.8    20.00     1.0
      6.375 47.375     3.8    19.99     1.0
      6.625 47.375     3.8    19.97     1.0
      6.875 47.375     3.7    19.84     1.0
      7.125 47.375     3.7    20.00     1.0
 1980   1   5   5
      6.125 47.375     2.2    20.00     1.0
      6.375 47.375     2.2    19.99     1.0
      6.625 47.375     2.2    19.97     1.0
      6.875 47.375     2.2    19.84     1.0
      7.125 47.375     4.8    20.00     1.0

这是我用来解释原始原始纯文本数据的最小示例：

foo = read.table("grid_data_demo.txt", header=FALSE, skip=1, nrows = 5)
colnames(foo) = c("long", "lat", "precip", "err1", "err2")

更新 3:

在原始明文数据中，没有文本分隔符，也没有将所有数据放置在纯文本中的列表。我创建了 miniDat 作为可重现的示例，因为我想从原始原始数据 ((please take a look example data snipt on the fly) )

中获取类似对象的列表

miniDat = list(dat_19800101 = data.frame(long=c( 6.125 ,6.375, 6.625, 6.875, 7.125),
                                         lat=c(47.375, 47.375, 47.375, 47.375, 47.375),
                                         precip=c(0, 0, 0, 0, 0),
                                         err1=c(20.00, 19.99, 19.97, 19.84, 20.00),
                                         err2=c(1, 1, 1, 1, 1)),
               dat_19800102 = data.frame(long=c( 6.125 ,6.375, 6.625, 6.875, 7.125),
                                         lat=c(47.375, 47.375, 47.375, 47.375, 47.375),
                                         precip=c(1.5, 1.5, 1.5, 1.5, 2.9),
                                         err1=c(20.00, 19.99, 19.97, 19.84, 20.00),
                                         err2=c(1, 1, 1, 1, 1)),

               dat_19800103 = data.frame(long=c( 6.125 ,6.375, 6.625, 6.875, 7.125),
                                         lat=c(47.375, 47.375, 47.375, 47.375, 47.375),
                                         precip=c(3.3, 3.3, 3.3, 3.3, 1.3),
                                         err1=c(20.00, 19.99, 19.97, 19.84, 20.00),
                                         err2=c(1, 1, 1, 1, 1)),
               dat_19800104 = data.frame(long=c( 6.125 ,6.375, 6.625, 6.875, 7.125),
                                         lat=c(47.375, 47.375, 47.375, 47.375, 47.375),
                                         precip=c(3.8, 3.8, 3.8, 3.7, 3.7),
                                         err1=c(20.00, 19.99, 19.97, 19.84, 20.00),
                                         err2=c(1, 1, 1, 1, 1)),
               dat_19800105 = data.frame(long=c( 6.125 ,6.375, 6.625, 6.875, 7.125),
                                         lat=c(47.375, 47.375, 47.375, 47.375, 47.375),
                                         precip=c(2.2, 2.2, 2.2, 2.2, 4.8),
                                         err1=c(20.00, 19.99, 19.97, 19.84, 20.00),
                                         err2=c(1, 1, 1, 1, 1)))

所以我想从原始原始纯文本数据重建类似矩阵的表格数据，并分别对每个网格点进行年度统计。也许，dplyr 或 data.table 提供了实用程序来处理这种操作。有什么快速的解决方案来进行这种数据转换吗？如何在 dplyr 实用程序中轻松实现这一点？有什么想法吗？

期望输出:

在我的预期输出中，我想删除 fourth (err1) 和 fifth (err2) 列，同时保持 long 和 lat 列的相同维度，并每天重复 precip 值作为新列。这是我预期输出的可重现示例：

desired_output = data.frame(
    long=c( 6.125 ,6.375, 6.625, 6.875, 7.125),
    lat=c(47.375, 47.375, 47.375, 47.375, 47.375),
    precip_day1=c(0, 0, 0, 0, 0),
    precip_day2=c(1.5, 1.5, 1.5, 1.5, 2.9),
    precip_day3=c(3.3, 3.3, 3.3, 3.3, 1.3),
    precip_day4=c(3.8, 3.8, 3.8, 3.7, 3.7),
    precip_day5=c(2.2, 2.2, 2.2, 2.2, 4.8)
)

基本上，我想简化原始原始数据并将其重构为类似矩阵的表格数据，以便更容易计算每个网格坐标的年平均值precip。为了简化和提高效率，在我预期的最终输出中，我希望分别拥有所有 long、lat 和 annual_mn_precip 列。

如何在 R 中进行数据简化和转换？有更简单的方法吗？谢谢

Answer 1

您可以使用 readLines 将原始文本文件读入文件中的行向量。然后您可以确定哪些行包含日期，哪些行包含观察结果（根据本例中的缩进）；将它们读入单独的数据帧；并根据包含日期的行的索引组合数据框。这是执行此操作的代码：

parse_weather <- function(file) {
  lines <- readLines(file)

  # Indicators for whether a line contains a date or an observation
  date_lines <- !startsWith(lines, " ")
  data_lines <- !date_lines

  # Number of observations for each date
  nobs <- diff(c(which(date_lines), length(lines) + 1)) - 1

  dates <- read.table(
    # repeat date for each observation
    text = paste(rep(lines[date_lines], nobs), collapse = "\n"),
    col.names = c("year", "month", "day", "days")
  )

  observations <- read.table(
    text = paste(lines[data_lines], collapse = "\n"),
    col.names = c("long", "lat", "precip", "err1", "err2")
  )

  cbind(dates, observations)
}

# I saved the example data snippet as a local text file
weather <- parse_weather("weather.txt")
head(weather, 8)
#>   year month day days  long    lat precip  err1 err2
#> 1 1980     1   1    1 6.125 47.375    0.0 20.00    1
#> 2 1980     1   1    1 6.375 47.375    0.0 19.99    1
#> 3 1980     1   1    1 6.625 47.375    0.0 19.97    1
#> 4 1980     1   1    1 6.875 47.375    0.0 19.84    1
#> 5 1980     1   1    1 7.125 47.375    0.0 20.00    1
#> 6 1980     1   2    2 6.125 47.375    1.5 20.00    1
#> 7 1980     1   2    2 6.375 47.375    1.5 19.99    1
#> 8 1980     1   2    2 6.625 47.375    1.5 19.97    1

以这种导入策略留下的长格式处理这些数据可能更容易。如果你想每天都有一个列，你可以通过重塑数据来实现，例如tidyr::spread 或 rehsape2::dcast.

编辑： 事实证明，read.table 对于作为 text 参数给出的大向量输入非常慢。将 lines 向量粘贴到单个字符串中可以使大文件的处理速度更快：我相应地更新了答案。

Answer 2

这个问题的显着特点是：

每天一个 header 记录，每天有可变数量的观察/详细记录
不同的详细观察行不包含将 header 链接到详细信息的键
Header 记录有 4 列，详细记录有 5 列
由于经度坐标可能在小数点左侧最多有 3 位数字，因此我们无法解析第一列中的空白记录以区分 header 记录和详细记录

读取此文件并将 header 信息与详细信息对齐的最直接方法是利用文本处理来重塑文件，使其每条记录包含一个观察结果。一旦原始数据被重塑，就可以用 read.table() 轻松读取。

所需的转换在基础 R 中通过 readLines() 和 lapply() 的组合完成。

inFile <- "./data/tempdata1980.txt"
outputFile <- "./data/tempData.txt"
# delete output file if it already exists
if (file.exists(outputFile)) file.remove(outputFile)
theText <- readLines(inFile)
header <- NULL # scope to retain header across executions of lapply()
theResult <- lapply(theText,function(x){
     # reduce blanks to 1 between tokens 
     aRow <- unlist(strsplit(gsub("^ *|(?<= ) | *$", "", x, perl = TRUE)," "))
     # use <<- form of assignment operator to set to parent of if() environment 
     if (length(aRow) == 4) header <<- x
     else {
          cat(paste(header,x),file=outputFile,
              sep="\n",append=TRUE)
     }
})
# now read with read.table
colNames <- c("year","month","day","dayOfYear","long","lat","precip","err1","err2")
theData <- read.table(outputFile,header=FALSE,col.names = colNames)

...输出：

> head(theData)
  year month day dayOfYear  long    lat precip  err1 err2
1 1980     1   1         1 6.125 47.375    0.0 20.00    1
2 1980     1   1         1 6.375 47.375    0.0 19.99    1
3 1980     1   1         1 6.625 47.375    0.0 19.97    1
4 1980     1   1         1 6.875 47.375    0.0 19.84    1
5 1980     1   1         1 7.125 47.375    0.0 20.00    1
6 1980     1   2         2 6.125 47.375    1.5 20.00    1
>

这种方法还避免了必须跟踪每天的观察次数，以便将 header 记录与正确数量的详细记录合并。

更新：提高解决方案的性能

根据对此答案的评论，脚本需要花费大量时间来执行 OP 中引用的完整数据。原始数据文件有 407,705 行：365 header 条记录和 407,340 条详细记录。上述解决方案在具有以下配置的 MacBook Pro 上转换数据并将其加载到数据框中大约需要 155 秒。

操作系统：OS X Yosemite 10.10.4 (14E46)
处理器：Intel i5，2.6Ghz，睿频至 3.3Ghz，两个内核
内存：8 GB
磁盘：512 GB，固态硬盘
建造日期：2013 年 4 月

性能低下的原因

与提供给 post 的其他答案相比，有两个潜在的缓慢来源，包括：

使用字符串函数 gsub() 和 strsplit()，其中一个生成字符串列表作为其输出
在循环中使用 cat(...,append=TRUE)，这意味着 R 必须打开文件、导航到末尾并添加内容超过 400,000 次。

性能优化

我们通过以下方式调整了代码以提高其性能。

使用 readr 库进行读写，因为它运行速度比基本 R 函数快得多
将 lapply() 的输出写入内存中的向量，而不是在执行 lapply() 期间写入磁盘，并且
用readr::write_lines()

更新后的版本运行时间大约为 23 秒，与原始版本相比有了很大的改进。下面列出了修改后的代码和执行时间。

inFile <- "./data/25_krig_all_1980.txt"
outputFile <- "./data/tempData.txt"
if (file.exists(outputFile)) file.remove(outputFile)
library(readr)
system.time(theText <- readLines(inFile))
#   user  system elapsed 
#  1.821   0.027   1.859 

header <- NULL # scope to retain header across executions of lapply()
outVector <- NULL
i <- 1 
system.time(theResult <- lapply(theText,function(x){
     # reduce blanks to 1 between tokens 
     aRow <- unlist(strsplit(gsub("^ *|(?<= ) | *$", "", x, perl = TRUE)," "))
     # use <<- form of assignment operator to set to parent of if() environment 
     if (length(aRow) == 4) header <<- x
     else {
          outVector[i] <<- paste(header,x)
          i <<- i + 1
     }
}))
#   user  system elapsed 
# 19.327   0.085  19.443 

# write to file
system.time(write_lines(outVector,outputFile))
#   user  system elapsed 
#  0.079   0.020   0.117 

# now read with read.table
colNames <- c("year","month","day","dayOfYear","long","lat","precip","err1","err2")
system.time(theData <- read_table2(outputFile,col_names = colNames))
#  user  system elapsed 
# 0.559   0.071   0.794

一次最终优化

OP 的另一个答案检查记录的第一列是否为空白，以确定记录是 header 还是详细记录。在我的 post 的顶部，我注意到由于经度可能在小数点前有 3 位数字，因此这种技术是有风险的。

然而，事实证明此数据没有大于本初子午线以东或以西 100 度的经度，因此我们可以使用以下代码提高脚本的运行时性能。

header <- NULL # scope to retain header across executions of lapply()
outVector <- NULL
i <- 1
system.time(theResult <- lapply(theText,function(x){
     # use <<- form of assignment operator to set to parent of if() environment 
     if (substr(x,1,1) != " ") header <<- x
     else {
          outVector[i] <<- paste(header,x)
          i <<- i + 1
     }
}))
#   user  system elapsed 
#  2.840   0.080   2.933

正如我们从性能计时中看到的那样，用简单的子字符串比较替换 strsplit() 可以将 lapply() 步骤的运行时间从近 20 秒缩短到大约 3 秒。

使用 dplyr 实用程序将纯文本数据重塑为常规表格数据的任何方法？

Any way to reshape plain text data to regular tabular data with dplyr' utilities?

r

data-manipulation

reshape

dplyr

更新：提高解决方案的性能

性能低下的原因

性能优化

一次最终优化