指定第一行缺失值时要读取的列数
Specify number of columns to read when first row is missing values
我有来自记录器的数据,该记录器将时间戳作为行插入逗号分隔的数据中。我已经找到了一种将这些时间戳整理成整洁数据框的方法(感谢对 的回复)。
我现在遇到的问题是时间戳行没有与数据行相同数量的逗号分隔值(3 对 6),并且 readr
默认为只读在只有 3 列中,尽管 我手动指定了 6 的列类型和名称。去年夏天(当我最后一次使用记录器时)readr
正确读取数据,但对我来说令人沮丧的是,当前版本 (2.1.1) 会发出警告并将列 3:6 混在一起。我希望有一些选项可以“纠正”回旧行为,或者一些我没有想到的解决方案(编辑记录器文件不是一个选项)。
示例代码:
library(tidyverse)
# example data
txt1 <- "
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# example without timestamp header
txt2 <- "
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# throws warning and reads 3 columns
read_csv(
txt1,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# works correctly
read_csv(
txt2,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# this is the table that older readr versions would create
# and that I'm hoping to get back to
tribble(
~lon, ~lat, ~n, ~red, ~nir, ~NDVI,
NA, NA, "Logger Start 12:34", NA, NA, NA,
-112, 53, "N=1", 9, 15, ".25",
-112, 53, "N=2",12, 17, ".17"
)
使用基础 read.csv
然后在需要时转换为类型:
read.csv(text=txt1, header = FALSE,
col.names = c("lon", "lat", "n", "red", "nir", "NDVI"))
lon lat n red nir NDVI
1 NA NA Logger Start 12:34 NA NA NA
2 -112 53 N=1 9 15 0.25
3 -112 53 N=2 12 17 0.17
我想我会使用 read_lines
和 write_lines
将“坏 CSV”转换为“好 CSV”,然后读入转换后的数据。
假设您有一个这样的文件 test.csv
:
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
尝试这样的事情:
library(dplyr)
library(tidyr)
read_lines("test.csv") %>%
# assumes all timestamp lines are the same format
gsub(",,Logger Start (.*?)$", "\1,,,,,,", ., perl = TRUE) %>%
# assumes that NDVI (last column) is always present and ends with a digit
# you'll need to alter the regex if not the case
gsub("^(.*?\d)$", ",\1", ., perl = TRUE) %>%
write_lines("test_out.csv")
test_out.csv
现在看起来像这样:
12:34,,,,,,
,-112,53,N=1,9,15,.25
,-112,53,N=2,12,17,.17
所以我们现在有 7 列,第一列是时间戳。
此代码读取新文件,填充缺少的时间戳值并删除 n
为 NA 的行。您可能不想这样做,我假设 n
只是因为带有时间戳的原始行而丢失。
mydata <- read_csv("test_out.csv",
col_names = c("ts", "lon", "lat", "n", "red", "nir", "NDVI")) %>%
fill(ts) %>%
filter(!is.na(n))
决赛mydata
:
# A tibble: 2 x 7
ts lon lat n red nir NDVI
<time> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 12:34 -112 53 N=1 9 15 0.25
2 12:34 -112 53 N=2 12 17 0.17
我有来自记录器的数据,该记录器将时间戳作为行插入逗号分隔的数据中。我已经找到了一种将这些时间戳整理成整洁数据框的方法(感谢对
我现在遇到的问题是时间戳行没有与数据行相同数量的逗号分隔值(3 对 6),并且 readr
默认为只读在只有 3 列中,尽管 我手动指定了 6 的列类型和名称。去年夏天(当我最后一次使用记录器时)readr
正确读取数据,但对我来说令人沮丧的是,当前版本 (2.1.1) 会发出警告并将列 3:6 混在一起。我希望有一些选项可以“纠正”回旧行为,或者一些我没有想到的解决方案(编辑记录器文件不是一个选项)。
示例代码:
library(tidyverse)
# example data
txt1 <- "
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# example without timestamp header
txt2 <- "
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# throws warning and reads 3 columns
read_csv(
txt1,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# works correctly
read_csv(
txt2,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# this is the table that older readr versions would create
# and that I'm hoping to get back to
tribble(
~lon, ~lat, ~n, ~red, ~nir, ~NDVI,
NA, NA, "Logger Start 12:34", NA, NA, NA,
-112, 53, "N=1", 9, 15, ".25",
-112, 53, "N=2",12, 17, ".17"
)
使用基础 read.csv
然后在需要时转换为类型:
read.csv(text=txt1, header = FALSE,
col.names = c("lon", "lat", "n", "red", "nir", "NDVI"))
lon lat n red nir NDVI
1 NA NA Logger Start 12:34 NA NA NA
2 -112 53 N=1 9 15 0.25
3 -112 53 N=2 12 17 0.17
我想我会使用 read_lines
和 write_lines
将“坏 CSV”转换为“好 CSV”,然后读入转换后的数据。
假设您有一个这样的文件 test.csv
:
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
尝试这样的事情:
library(dplyr)
library(tidyr)
read_lines("test.csv") %>%
# assumes all timestamp lines are the same format
gsub(",,Logger Start (.*?)$", "\1,,,,,,", ., perl = TRUE) %>%
# assumes that NDVI (last column) is always present and ends with a digit
# you'll need to alter the regex if not the case
gsub("^(.*?\d)$", ",\1", ., perl = TRUE) %>%
write_lines("test_out.csv")
test_out.csv
现在看起来像这样:
12:34,,,,,,
,-112,53,N=1,9,15,.25
,-112,53,N=2,12,17,.17
所以我们现在有 7 列,第一列是时间戳。
此代码读取新文件,填充缺少的时间戳值并删除 n
为 NA 的行。您可能不想这样做,我假设 n
只是因为带有时间戳的原始行而丢失。
mydata <- read_csv("test_out.csv",
col_names = c("ts", "lon", "lat", "n", "red", "nir", "NDVI")) %>%
fill(ts) %>%
filter(!is.na(n))
决赛mydata
:
# A tibble: 2 x 7
ts lon lat n red nir NDVI
<time> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 12:34 -112 53 N=1 9 15 0.25
2 12:34 -112 53 N=2 12 17 0.17