从 R 中的文本文件中提取 table(和其他信息)

Extracting a table (and other information) from a text file in R

我正在尝试使用 R 从 historical Met Office data 中提取数据 table - 以及其他一些信息,但尽管整个晚上都在 Whosebug 上保持 运行 遇到问题。

例如,这是 sunny (maybe??) Lowestoft 的数据:

Lowestoft / Lowestoft Monckton Ave from Sept 2007
Location 654300E 294600N 25m amsl to July 2007 
& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by  ---.
Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder.
   yyyy  mm   tmax    tmin      af    rain     sun
              degC    degC    days      mm   hours
   1914   1    5.2     0.7    ---     52.0    ---
   1914   2    9.2     3.5    ---     28.0    ---
   1914   3   ---     ---     ---     ---     ---
   1914   4   12.9     5.3    ---     18.0    ---
   ...
   2020  11   12.5*    6.1*      0*   31.9*   73.7*  Provisional
   2020  12    7.7*    2.9*      6*  105.8*   50.5*  Provisional
   2021   1    5.8*    1.2*     10*   78.6*   49.4*  Provisional
   2021   2    7.9*    2.4*      9*   48.6*   84.7*  Provisional

到目前为止,我最好的办法是使用 sed(在 R 之外)删除 *'d 和 #'d 变量,但是使用 read.table(lowestoftdata.text, skip = 8, col.names = c("year","month","max_temp", "min_temp", "frost", "rainfall", "sunshine")) 导入它会失败它命中标记为临时的 2020 年以后的数据。提取纬度和经度值也非常方便,这些值通常位​​于第 2 行,但如果像 Lowestoft 一样,车站在某个时间点移动,则可以位于第 3 行,但我的 very 有限的正则表达式知识(和一个移动的目标)让我失望。

我的伪代码方法是:

  1. 用纬度和经度识别线,解析该线以提取那些变量
  2. 识别以数字开头的第一行,并从该行开始 read.table

...但事实证明,将其付诸实践具有挑战性,因为我处理除格式良好的 CSV 文件以外的任何内容的经验有限,因此非常感谢任何关于从哪里开始的建议。

这是“解析”header 文本请求的一种方法:

metadata <- 
 readLines(url("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt"), n=9)
> metadata
[1] "Lowestoft / Lowestoft Monckton Ave from Sept 2007"                                                                                        
[2] "Location 654300E 294600N 25m amsl to July 2007 "                                                                                          
[3] "& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl"                                                                         
[4] "Estimated data is marked with a * after the value."                                                                                       
[5] "Missing data (more than 2 days missing in month) is marked by  ---."                                                                      
[6] "Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder."
[7] "   yyyy  mm   tmax    tmin      af    rain     sun"                                                                                       
[8] "              degC    degC    days      mm   hours"  

                                                                               

> sub( "Location (\d+[EW]) (\d+[NS])(.+$)", "\1,\2", metadata[2])
[1] "654300E,294600N"

我需要对数据应用“标尺”以获得 read.fwf 方法的位置和宽度。

> paste( rep("123456789",6), 1:6, collapse="", sep="")
[1] "123456789112345678921234567893123456789412345678951234567896"
> metadata[9]
[1] "   1914   1    5.2     0.7    ---     52.0    ---"

这是字符的结果。在使用 as.numeric 之前,您需要做一些进一步的处理以去除星号。我用一个专栏来说明它。您可能可以使用 metadata[9]

来编辑列名
 widths=c(3,4,4,7,8,7,10,7)

 dat=read.fwf( "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt", widths = widths , skip=8, colClasses="character", header=FALSE)
Warning message:
In readLines(file, n = thisblock) :
  incomplete final line found on 'https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt'
 tail(dat)
#---------------------
      V1   V2   V3      V4       V5      V6         V7      V8
1269     2020    9    19.6 *   11.5 *       0*   97.1*   168.6
1270     2020   10    14.2 *    9.0 *       0*   85.7*    58.8
1271     2020   11    12.5 *    6.1 *       0*   31.9*    73.7
1272     2020   12     7.7 *    2.9 *       6*  105.8*    50.5
1273     2021    1     5.8 *    1.2 *     1 0*   78.6*    49.4
1274     2021    2     7.9 *    2.4 *       9*   48.6*    84.7
#----------------
head(dat)
   V1   V2   V3      V4       V5      V6         V7     V8
1     1914    1     5.2      0.7     ---      52.0     ---
2     1914    2     9.2      3.5     ---      28.0     ---
3     1914    3    ---      ---      ---      ---      ---
4     1914    4    12.9      5.3     ---      18.0     ---
5     1914    5    13.7      7.2     ---      38.0     ---
6     1914    6    16.2     10.4     ---      38.0     ---

summary(as.numeric(sub("[*]","", dat$V8)))
#--------------------
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   11.0    70.3   136.3   136.1   189.9   314.4     157 

还有?readr::read_fwf,它有一些优点。一方面,它允许您使用位置而不是宽度来指定 fwf。我发现这更容易,特别是如果你使用我的临时“尺子”。

这是另一条路:

清理这个需要很多不同的东西。

首先处理 2 行 header(这总是很痛苦)。可能有更简单的解决方案,但在某些时候您只需要完成工作即可。

我将两行合并为一行,并将那些稍长的文本用作 headers。

读取数据之前的清理步骤有点神秘,但它会从行尾去除任何不是数字、破折号或星号的东西。 (对于 trim 那些文本注释,否则这些注释会扰乱使用 fread 解析的字段,fread 的速度非常快。)


library(data.table)
library(purrr)

raw.text <- read_file("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt")

lat.long <- as.numeric( str_match( raw.text, "Lat (\d+\.\d+) Lon (\d+\.\d+)" )[,-1] )

m <- regexpr( "+yyyy.*hours", raw.text )

headertext <- substr( raw.text, m, m+attr(m,"match.length")-1 )
header.lines <- strsplit( headertext, "\r?\n" )[[1]]
header.lines <- sub( "^\s+", "", header.lines )
header.fields2 <- strsplit( header.lines, "\s+" )
header.fields2[[2]] <- c( "", "", header.fields2[[2]] )

header.fields <- pmap_chr( header.fields2, paste, collapse=" " ) %>% str_trim

## some cleanup:
text.to.read <- substring( raw.text, m+attr(m,"match.length") )

## This next line matches anything that is not a digit (\d) and not a dash (\-) and not a star (\*) until the end of the line, $. It's the enclosing (?m: ... ) that changes $ to match end of line, and not end of string as usual.
text.to.read2 <- gsub( "(?m:([^\d\-\*]*)$)", "", text.to.read, perl=TRUE )

## by now a simple fread will do the rest for us
d <- fread( text=text.to.read2, fill=TRUE, header=FALSE, na="---" )
setnames(d, header.fields)

d

输出:


      yyyy mm tmax degC tmin degC af days rain mm sun hours
   1: 1914  1       5.2       0.7    <NA>    52.0      <NA>
   2: 1914  2       9.2       3.5    <NA>    28.0      <NA>
   3: 1914  3      <NA>      <NA>    <NA>    <NA>      <NA>
   4: 1914  4      12.9       5.3    <NA>    18.0      <NA>
   5: 1914  5      13.7       7.2    <NA>    38.0      <NA>
  ---                                                      
1270: 2020 10     14.2*      9.0*      0*   85.7*     58.8*
1271: 2020 11     12.5*      6.1*      0*   31.9*     73.7*
1272: 2020 12      7.7*      2.9*      6*  105.8*     50.5*
1273: 2021  1      5.8*      1.2*     10*   78.6*     49.4*
1274: 2021  2      7.9*      2.4*      9*   48.6*     84.7*