使用文本文件中的日期和数据创建数据框

Question

我想从文件创建数据框。首先，我从互联网上将 ftp 文件下载到我的电脑中。可以在 link: data from ftp 中找到文件。我想在一列上创建一个数据框（取自每个文件的 Launch date 行），在其他列上也创建数据。
为了抓取所有文件，我使用了以下代码：

`setwd("C:/Users/")
path = "~C:/Users/"
files <- lapply(list.files(pattern = '\.l100'), readLines) 
test.sample<-do.call(rbind, lapply(files, function(lines){
  # for each file, return a data.frame of the datetime, pulled with regex
  data.frame(datetime = lubridate::dmy(sub('^.*Launch Date*    : ', '', lines[6])),
             # and the data, read in as text
             read.table(text = lines[13:length(lines)]))
}))`

我想知道上面的代码有什么问题。如果您可以编写新代码也很棒。提前谢谢你。

数据如下所示：

National Oceanic and Atmospheric Administration, U.S. Department of Commerce
Station        : Pago Pago, American Samoa
Station Height : 5 meters
Latitude       : -14.33
Longitude      : -170.71
Flight Number  : ASA016
Launch Date    : 18 July 1986
Launch Time    : 02:31:00 GMT
Radiosonde Type: 
Radiosonde Num : 
O3 Sonde ID    : 4A1834   Background: 0.050 microamps (0.20 mPa)  Flowrate: 29.90 sec/100ml  RH Corr: 2.00 %
Sonde Total O3 : 268   (51) DU      Sonde Total O3 (SBUV): 251 (35) DU

Level   Press    Alt   Pottp   Temp   FtempV   Hum  Ozone  Ozone   Ozone  Ptemp  O3 # DN O3 Res
 Num     hPa      km     K      C       C       %    mPa    ppmv   atmcm    C   10^11/cc   DU
   0  1007.7   0.005   304.0   31.5    26.0     73   1.63  0.016  0.0000   38.4    3.883    268
   1  9999.9   0.100  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999
   2   982.0   0.200   301.2   26.5    21.4     74   1.69  0.017  0.0003   38.4    4.096    267
   3  9999.9   0.300  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999
   4   960.0   0.400   301.2   24.6    20.6     79   1.64  0.017  0.0006   38.4    3.983    267
   5  9999.9   0.500  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999
   6   946.0   0.600   301.1   23.2    20.2     83   1.65  0.017  0.0008   38.4    4.021    267
   7   932.0   0.700   300.9   21.8    19.9     89   1.65  0.018  0.0010   38.5    4.040    267
   8   920.9   0.800   301.1   21.0    16.4     75   1.65  0.018  0.0012   38.5    4.051    266
   9  9999.9   0.900  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999
  10  9999.9   1.000  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999
  11   893.0   1.100   302.0   19.2    12.1     64   1.58  0.018  0.0016   38.5    3.923    266
  12  9999.9   1.200  9999.9  999.9   999.9    999  99.90 99.999 99.9990  999.9  999.999   9999

我有很多这样的文件。再次感谢。

Answer 1

虽然我喜欢这个 fixed-width table 文件中的元数据和 self-documentation，但以编程方式导入它们可能会很有趣。 Excel 碰巧做得很好，因为它不会抱怨与模式不匹配的行，但这无助于导入大量文件。

所以我们需要一些东西 home-baked。我只下载了其中两个文件，但这应该与更多文件同样有效。（我假设格式是一致的，否则我们需要修复一些魔法常量，例如 15（要忽略的行）和 29（包含 table headers 的行）。）

(files <- list.files(pattern = "asa*"))
# [1] "asa001_1986_04_01_20.l100" "asa002_1986_04_14_18.l100"

这是主力军。与 "just" 按照您的要求获取日期不同，获取所有元数据 headers 的努力大致相同。他们通过在某些行上放置倍数（以及一些带有 space 的标签）使其出现了一些问题，但我想我找到了一个 suitable 模式来找到它们。

alldfs <- lapply(files, function(f) {
  # read in all but the data; two MAGIC CONSTANTS
  txt <- readLines(f, n = 29)[-(1:15)]
  i <- which(! nzchar(trimws(txt)))
  hdrs <- sapply(strsplit(txt[1:(i-1)], ": "), trimws)
  hdrsdf <- as.data.frame(Reduce(c, lapply(hdrs, function(hdr) {
    z <- strsplit(hdr, "  ")
    if (length(z) == 1) {
      setNames(NA_character_, nm = z[[1]])
    } else {
      setNames(lapply(z[-1], head, n=1), nm = trimws(sapply(z[-length(z)], tail, n=1)))
    }
  })), stringsAsFactors = FALSE)
  df <- as.data.frame(lapply(read.table(f, skip=27, stringsAsFactors=F)[-1,],
                             as.numeric))
  cbind.data.frame(hdrsdf, df)
})
length(alldfs)
# [1] 2

结果很宽data.frame。前 13 列是不变的，元数据 headers。如果您只想要 Launch.Date，请随意放弃其余部分。

str(alldfs[[1]])
# 'data.frame': 361 obs. of  27 variables:
#  $ Station              : chr  "Pago Pago, American Samoa" "Pago Pago, American Samoa" "Pago Pago, American Samoa" "Pago Pago, American Samoa" ...
#  $ Station.Height       : chr  "5 meters" "5 meters" "5 meters" "5 meters" ...
#  $ Latitude             : chr  "-14.33" "-14.33" "-14.33" "-14.33" ...
#  $ Longitude            : chr  "-170.71" "-170.71" "-170.71" "-170.71" ...
#  $ Flight.Number        : chr  "ASA001" "ASA001" "ASA001" "ASA001" ...
#  $ Launch.Date          : chr  "01 April 1986" "01 April 1986" "01 April 1986" "01 April 1986" ...
#  $ Launch.Time          : chr  "20:40:00 GMT" "20:40:00 GMT" "20:40:00 GMT" "20:40:00 GMT" ...
#  $ Radiosonde.Type      : chr  NA NA NA NA ...
#  $ Radiosonde.Num       : chr  NA NA NA NA ...
#  $ O3.Sonde.ID          : chr  "4A1911" "4A1911" "4A1911" "4A1911" ...
#  $ Background           : chr  "0.050 microamps (0.19 mPa)" "0.050 microamps (0.19 mPa)" "0.050 microamps (0.19 mPa)" "0.050 microamps (0.19 mPa)" ...
#  $ Flowrate             : chr  "29.10 sec/100ml" "29.10 sec/100ml" "29.10 sec/100ml" "29.10 sec/100ml" ...
#  $ RH.Corr              : chr  "2.00 %" "2.00 %" "2.00 %" "2.00 %" ...
#  $ Sonde.Total.O3       : chr  "242" "242" "242" "242" ...
#  $ Sonde.Total.O3..SBUV.: chr  "233 (36) DU" "233 (36) DU" "233 (36) DU" "233 (36) DU" ...
#  $ Level                : num  1010 10000 10000 10000 971 ...
#  $ Press                : num  0.005 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ...
#  $ Alt                  : num  302 10000 10000 10000 302 ...
#  $ Pottp                : num  30.2 999.9 999.9 999.9 25.8 ...
#  $ Temp                 : num  24.6 999.9 999.9 999.9 20.7 ...
#  $ FtempV               : num  72 999 999 999 74 999 999 999 75 999 ...
#  $ Hum                  : num  1.65 99.9 99.9 99.9 1.68 99.9 99.9 99.9 1.48 99.9 ...
#  $ Ozone                : num  0.016 99.999 99.999 99.999 0.017 ...
#  $ Ozone.1              : num  0e+00 1e+02 1e+02 1e+02 5e-04 ...
#  $ Ozone.2              : num  38.1 999.9 999.9 999.9 38.1 ...
#  $ Ptemp                : num  3.93 1000 1000 1000 4.07 ...
#  $ O3                   : num  242 9999 9999 9999 241 ...

使用文本文件中的日期和数据创建数据框

Creating data frame with dates and data from a text files

ftp

r

dataframe

data-extraction

read.table