指定第一行缺失值时要读取的列数

Specify number of columns to read when first row is missing values

我有来自记录器的数据,该记录器将时间戳作为行插入逗号分隔的数据中。我已经找到了一种将这些时间戳整理成整洁数据框的方法(感谢对 的回复)。

我现在遇到的问题是时间戳行没有与数据行相同数量的逗号分隔值(3 对 6),并且 readr 默认为只读在只有 3 列中,尽管 我手动指定了 6 的列类型和名称。去年夏天(当我最后一次使用记录器时)readr 正确读取数据,但对我来说令人沮丧的是,当前版本 (2.1.1) 会发出警告并将列 3:6 混在一起。我希望有一些选项可以“纠正”回旧行为,或者一些我没有想到的解决方案(编辑记录器文件不是一个选项)。

示例代码:

library(tidyverse)

# example data
txt1 <- "
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"

# example without timestamp header
txt2 <- "
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"

# throws warning and reads 3 columns
read_csv(
  txt1,
  col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
  col_types = "ddcddc"
)

# works correctly
read_csv(
  txt2,
  col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
  col_types = "ddcddc"
)

# this is the table that older readr versions would create
# and that I'm hoping to get back to
tribble(
  ~lon, ~lat, ~n, ~red, ~nir, ~NDVI,
    NA,   NA, "Logger Start 12:34", NA, NA, NA,
  -112,   53, "N=1", 9, 15, ".25",
  -112,   53, "N=2",12, 17, ".17"
)

使用基础 read.csv 然后在需要时转换为类型:

read.csv(text=txt1, header = FALSE,
     col.names = c("lon", "lat", "n", "red", "nir", "NDVI"))
   lon lat                  n red nir NDVI
1   NA  NA Logger Start 12:34  NA  NA   NA
2 -112  53                N=1   9  15 0.25
3 -112  53                N=2  12  17 0.17

我想我会使用 read_lineswrite_lines 将“坏 CSV”转换为“好 CSV”,然后读入转换后的数据。

假设您有一个这样的文件 test.csv

,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17

尝试这样的事情:

library(dplyr)
library(tidyr)

read_lines("test.csv") %>% 
  # assumes all timestamp lines are the same format
  gsub(",,Logger Start (.*?)$", "\1,,,,,,", ., perl = TRUE) %>%
  # assumes that NDVI (last column) is always present and ends with a digit
  # you'll need to alter the regex if not the case 
  gsub("^(.*?\d)$", ",\1", ., perl = TRUE) %>% 
  write_lines("test_out.csv")

test_out.csv 现在看起来像这样:

12:34,,,,,,
,-112,53,N=1,9,15,.25
,-112,53,N=2,12,17,.17

所以我们现在有 7 列,第一列是时间戳。

此代码读取新文件,填充缺少的时间戳值并删除 n 为 NA 的行。您可能不想这样做,我假设 n 只是因为带有时间戳的原始行而丢失。

mydata <- read_csv("test_out.csv", 
                   col_names = c("ts", "lon", "lat", "n", "red", "nir", "NDVI")) %>% 
  fill(ts) %>% 
  filter(!is.na(n))

决赛mydata:

# A tibble: 2 x 7
  ts       lon   lat n       red   nir  NDVI
  <time> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 12:34   -112    53 N=1       9    15  0.25
2 12:34   -112    53 N=2      12    17  0.17