从 R 中的文本文件中提取 table(和其他信息)
Extracting a table (and other information) from a text file in R
我正在尝试使用 R 从 historical Met Office data 中提取数据 table - 以及其他一些信息,但尽管整个晚上都在 Whosebug 上保持 运行 遇到问题。
例如,这是 sunny (maybe??) Lowestoft 的数据:
Lowestoft / Lowestoft Monckton Ave from Sept 2007
Location 654300E 294600N 25m amsl to July 2007
& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by ---.
Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder.
yyyy mm tmax tmin af rain sun
degC degC days mm hours
1914 1 5.2 0.7 --- 52.0 ---
1914 2 9.2 3.5 --- 28.0 ---
1914 3 --- --- --- --- ---
1914 4 12.9 5.3 --- 18.0 ---
...
2020 11 12.5* 6.1* 0* 31.9* 73.7* Provisional
2020 12 7.7* 2.9* 6* 105.8* 50.5* Provisional
2021 1 5.8* 1.2* 10* 78.6* 49.4* Provisional
2021 2 7.9* 2.4* 9* 48.6* 84.7* Provisional
到目前为止,我最好的办法是使用 sed
(在 R 之外)删除 *'d 和 #'d 变量,但是使用 read.table(lowestoftdata.text, skip = 8, col.names = c("year","month","max_temp", "min_temp", "frost", "rainfall", "sunshine"))
导入它会失败它命中标记为临时的 2020 年以后的数据。提取纬度和经度值也非常方便,这些值通常位于第 2 行,但如果像 Lowestoft 一样,车站在某个时间点移动,则可以位于第 3 行,但我的 very 有限的正则表达式知识(和一个移动的目标)让我失望。
我的伪代码方法是:
- 用纬度和经度识别线,解析该线以提取那些变量
- 识别以数字开头的第一行,并从该行开始 read.table
...但事实证明,将其付诸实践具有挑战性,因为我处理除格式良好的 CSV 文件以外的任何内容的经验有限,因此非常感谢任何关于从哪里开始的建议。
这是“解析”header 文本请求的一种方法:
metadata <-
readLines(url("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt"), n=9)
> metadata
[1] "Lowestoft / Lowestoft Monckton Ave from Sept 2007"
[2] "Location 654300E 294600N 25m amsl to July 2007 "
[3] "& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl"
[4] "Estimated data is marked with a * after the value."
[5] "Missing data (more than 2 days missing in month) is marked by ---."
[6] "Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder."
[7] " yyyy mm tmax tmin af rain sun"
[8] " degC degC days mm hours"
> sub( "Location (\d+[EW]) (\d+[NS])(.+$)", "\1,\2", metadata[2])
[1] "654300E,294600N"
我需要对数据应用“标尺”以获得 read.fwf
方法的位置和宽度。
> paste( rep("123456789",6), 1:6, collapse="", sep="")
[1] "123456789112345678921234567893123456789412345678951234567896"
> metadata[9]
[1] " 1914 1 5.2 0.7 --- 52.0 ---"
这是字符的结果。在使用 as.numeric
之前,您需要做一些进一步的处理以去除星号。我用一个专栏来说明它。您可能可以使用 metadata[9]
来编辑列名
widths=c(3,4,4,7,8,7,10,7)
dat=read.fwf( "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt", widths = widths , skip=8, colClasses="character", header=FALSE)
Warning message:
In readLines(file, n = thisblock) :
incomplete final line found on 'https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt'
tail(dat)
#---------------------
V1 V2 V3 V4 V5 V6 V7 V8
1269 2020 9 19.6 * 11.5 * 0* 97.1* 168.6
1270 2020 10 14.2 * 9.0 * 0* 85.7* 58.8
1271 2020 11 12.5 * 6.1 * 0* 31.9* 73.7
1272 2020 12 7.7 * 2.9 * 6* 105.8* 50.5
1273 2021 1 5.8 * 1.2 * 1 0* 78.6* 49.4
1274 2021 2 7.9 * 2.4 * 9* 48.6* 84.7
#----------------
head(dat)
V1 V2 V3 V4 V5 V6 V7 V8
1 1914 1 5.2 0.7 --- 52.0 ---
2 1914 2 9.2 3.5 --- 28.0 ---
3 1914 3 --- --- --- --- ---
4 1914 4 12.9 5.3 --- 18.0 ---
5 1914 5 13.7 7.2 --- 38.0 ---
6 1914 6 16.2 10.4 --- 38.0 ---
summary(as.numeric(sub("[*]","", dat$V8)))
#--------------------
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
11.0 70.3 136.3 136.1 189.9 314.4 157
还有?readr::read_fwf
,它有一些优点。一方面,它允许您使用位置而不是宽度来指定 fwf。我发现这更容易,特别是如果你使用我的临时“尺子”。
这是另一条路:
清理这个需要很多不同的东西。
首先处理 2 行 header(这总是很痛苦)。可能有更简单的解决方案,但在某些时候您只需要完成工作即可。
我将两行合并为一行,并将那些稍长的文本用作 headers。
读取数据之前的清理步骤有点神秘,但它会从行尾去除任何不是数字、破折号或星号的东西。 (对于 trim 那些文本注释,否则这些注释会扰乱使用 fread 解析的字段,fread 的速度非常快。)
library(data.table)
library(purrr)
raw.text <- read_file("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt")
lat.long <- as.numeric( str_match( raw.text, "Lat (\d+\.\d+) Lon (\d+\.\d+)" )[,-1] )
m <- regexpr( "+yyyy.*hours", raw.text )
headertext <- substr( raw.text, m, m+attr(m,"match.length")-1 )
header.lines <- strsplit( headertext, "\r?\n" )[[1]]
header.lines <- sub( "^\s+", "", header.lines )
header.fields2 <- strsplit( header.lines, "\s+" )
header.fields2[[2]] <- c( "", "", header.fields2[[2]] )
header.fields <- pmap_chr( header.fields2, paste, collapse=" " ) %>% str_trim
## some cleanup:
text.to.read <- substring( raw.text, m+attr(m,"match.length") )
## This next line matches anything that is not a digit (\d) and not a dash (\-) and not a star (\*) until the end of the line, $. It's the enclosing (?m: ... ) that changes $ to match end of line, and not end of string as usual.
text.to.read2 <- gsub( "(?m:([^\d\-\*]*)$)", "", text.to.read, perl=TRUE )
## by now a simple fread will do the rest for us
d <- fread( text=text.to.read2, fill=TRUE, header=FALSE, na="---" )
setnames(d, header.fields)
d
输出:
yyyy mm tmax degC tmin degC af days rain mm sun hours
1: 1914 1 5.2 0.7 <NA> 52.0 <NA>
2: 1914 2 9.2 3.5 <NA> 28.0 <NA>
3: 1914 3 <NA> <NA> <NA> <NA> <NA>
4: 1914 4 12.9 5.3 <NA> 18.0 <NA>
5: 1914 5 13.7 7.2 <NA> 38.0 <NA>
---
1270: 2020 10 14.2* 9.0* 0* 85.7* 58.8*
1271: 2020 11 12.5* 6.1* 0* 31.9* 73.7*
1272: 2020 12 7.7* 2.9* 6* 105.8* 50.5*
1273: 2021 1 5.8* 1.2* 10* 78.6* 49.4*
1274: 2021 2 7.9* 2.4* 9* 48.6* 84.7*
我正在尝试使用 R 从 historical Met Office data 中提取数据 table - 以及其他一些信息,但尽管整个晚上都在 Whosebug 上保持 运行 遇到问题。
例如,这是 sunny (maybe??) Lowestoft 的数据:
Lowestoft / Lowestoft Monckton Ave from Sept 2007
Location 654300E 294600N 25m amsl to July 2007
& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by ---.
Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder.
yyyy mm tmax tmin af rain sun
degC degC days mm hours
1914 1 5.2 0.7 --- 52.0 ---
1914 2 9.2 3.5 --- 28.0 ---
1914 3 --- --- --- --- ---
1914 4 12.9 5.3 --- 18.0 ---
...
2020 11 12.5* 6.1* 0* 31.9* 73.7* Provisional
2020 12 7.7* 2.9* 6* 105.8* 50.5* Provisional
2021 1 5.8* 1.2* 10* 78.6* 49.4* Provisional
2021 2 7.9* 2.4* 9* 48.6* 84.7* Provisional
到目前为止,我最好的办法是使用 sed
(在 R 之外)删除 *'d 和 #'d 变量,但是使用 read.table(lowestoftdata.text, skip = 8, col.names = c("year","month","max_temp", "min_temp", "frost", "rainfall", "sunshine"))
导入它会失败它命中标记为临时的 2020 年以后的数据。提取纬度和经度值也非常方便,这些值通常位于第 2 行,但如果像 Lowestoft 一样,车站在某个时间点移动,则可以位于第 3 行,但我的 very 有限的正则表达式知识(和一个移动的目标)让我失望。
我的伪代码方法是:
- 用纬度和经度识别线,解析该线以提取那些变量
- 识别以数字开头的第一行,并从该行开始 read.table
...但事实证明,将其付诸实践具有挑战性,因为我处理除格式良好的 CSV 文件以外的任何内容的经验有限,因此非常感谢任何关于从哪里开始的建议。
这是“解析”header 文本请求的一种方法:
metadata <-
readLines(url("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt"), n=9)
> metadata
[1] "Lowestoft / Lowestoft Monckton Ave from Sept 2007"
[2] "Location 654300E 294600N 25m amsl to July 2007 "
[3] "& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl"
[4] "Estimated data is marked with a * after the value."
[5] "Missing data (more than 2 days missing in month) is marked by ---."
[6] "Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder."
[7] " yyyy mm tmax tmin af rain sun"
[8] " degC degC days mm hours"
> sub( "Location (\d+[EW]) (\d+[NS])(.+$)", "\1,\2", metadata[2])
[1] "654300E,294600N"
我需要对数据应用“标尺”以获得 read.fwf
方法的位置和宽度。
> paste( rep("123456789",6), 1:6, collapse="", sep="")
[1] "123456789112345678921234567893123456789412345678951234567896"
> metadata[9]
[1] " 1914 1 5.2 0.7 --- 52.0 ---"
这是字符的结果。在使用 as.numeric
之前,您需要做一些进一步的处理以去除星号。我用一个专栏来说明它。您可能可以使用 metadata[9]
widths=c(3,4,4,7,8,7,10,7)
dat=read.fwf( "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt", widths = widths , skip=8, colClasses="character", header=FALSE)
Warning message:
In readLines(file, n = thisblock) :
incomplete final line found on 'https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt'
tail(dat)
#---------------------
V1 V2 V3 V4 V5 V6 V7 V8
1269 2020 9 19.6 * 11.5 * 0* 97.1* 168.6
1270 2020 10 14.2 * 9.0 * 0* 85.7* 58.8
1271 2020 11 12.5 * 6.1 * 0* 31.9* 73.7
1272 2020 12 7.7 * 2.9 * 6* 105.8* 50.5
1273 2021 1 5.8 * 1.2 * 1 0* 78.6* 49.4
1274 2021 2 7.9 * 2.4 * 9* 48.6* 84.7
#----------------
head(dat)
V1 V2 V3 V4 V5 V6 V7 V8
1 1914 1 5.2 0.7 --- 52.0 ---
2 1914 2 9.2 3.5 --- 28.0 ---
3 1914 3 --- --- --- --- ---
4 1914 4 12.9 5.3 --- 18.0 ---
5 1914 5 13.7 7.2 --- 38.0 ---
6 1914 6 16.2 10.4 --- 38.0 ---
summary(as.numeric(sub("[*]","", dat$V8)))
#--------------------
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
11.0 70.3 136.3 136.1 189.9 314.4 157
还有?readr::read_fwf
,它有一些优点。一方面,它允许您使用位置而不是宽度来指定 fwf。我发现这更容易,特别是如果你使用我的临时“尺子”。
这是另一条路:
清理这个需要很多不同的东西。
首先处理 2 行 header(这总是很痛苦)。可能有更简单的解决方案,但在某些时候您只需要完成工作即可。
我将两行合并为一行,并将那些稍长的文本用作 headers。
读取数据之前的清理步骤有点神秘,但它会从行尾去除任何不是数字、破折号或星号的东西。 (对于 trim 那些文本注释,否则这些注释会扰乱使用 fread 解析的字段,fread 的速度非常快。)
library(data.table)
library(purrr)
raw.text <- read_file("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt")
lat.long <- as.numeric( str_match( raw.text, "Lat (\d+\.\d+) Lon (\d+\.\d+)" )[,-1] )
m <- regexpr( "+yyyy.*hours", raw.text )
headertext <- substr( raw.text, m, m+attr(m,"match.length")-1 )
header.lines <- strsplit( headertext, "\r?\n" )[[1]]
header.lines <- sub( "^\s+", "", header.lines )
header.fields2 <- strsplit( header.lines, "\s+" )
header.fields2[[2]] <- c( "", "", header.fields2[[2]] )
header.fields <- pmap_chr( header.fields2, paste, collapse=" " ) %>% str_trim
## some cleanup:
text.to.read <- substring( raw.text, m+attr(m,"match.length") )
## This next line matches anything that is not a digit (\d) and not a dash (\-) and not a star (\*) until the end of the line, $. It's the enclosing (?m: ... ) that changes $ to match end of line, and not end of string as usual.
text.to.read2 <- gsub( "(?m:([^\d\-\*]*)$)", "", text.to.read, perl=TRUE )
## by now a simple fread will do the rest for us
d <- fread( text=text.to.read2, fill=TRUE, header=FALSE, na="---" )
setnames(d, header.fields)
d
输出:
yyyy mm tmax degC tmin degC af days rain mm sun hours
1: 1914 1 5.2 0.7 <NA> 52.0 <NA>
2: 1914 2 9.2 3.5 <NA> 28.0 <NA>
3: 1914 3 <NA> <NA> <NA> <NA> <NA>
4: 1914 4 12.9 5.3 <NA> 18.0 <NA>
5: 1914 5 13.7 7.2 <NA> 38.0 <NA>
---
1270: 2020 10 14.2* 9.0* 0* 85.7* 58.8*
1271: 2020 11 12.5* 6.1* 0* 31.9* 73.7*
1272: 2020 12 7.7* 2.9* 6* 105.8* 50.5*
1273: 2021 1 5.8* 1.2* 10* 78.6* 49.4*
1274: 2021 2 7.9* 2.4* 9* 48.6* 84.7*