如何告诉 readr::read_csv 正确猜测双列
How to tell readr::read_csv to guess double column correctly
我有很多零值的径流数据,偶尔还有一些非零双精度值。
'readr::read_csv' 猜测整数列类型,因为有很多零。
如何让 read_csv 猜出正确的双列类型?
我事先不知道变量名的映射,因此无法给出名称类型映射。
这是一个小例子
# create a column of doubles with many zeros (runoff data)
#dsTmp <- data.frame(x = c(rep(0.0, 2), 0.5)) # this works
dsTmp <- data.frame(x = c(rep(0.0, 1e5), 0.5))
write_csv(dsTmp, "tmp/dsTmp.csv")
# 0.0 is written as 0
# read_csv now guesses integer instead of double and reports
# a parsing failure.
ans <- read_csv("tmp/dsTmp.csv")
# the last value is NA instead of 0.5
tail(ans)
我可以告诉它选择尝试更宽的列类型而不是发出解析失败吗?
问题 645 提到了这个问题,但是给出的解决方法是在写作方面。我在写作方面影响不大
这里有两个技巧。 (底部的数据准备。$hp
和 $vs
及以后是整数列。)
注意:我将 cols(.default=col_guess())
添加到大多数首次调用中,这样我们就不会得到关于 read_csv
发现列是什么的大消息。可以省略它,代价是控制台噪音更大。
使用 cols(.default=...)
设置强制所有列为双列,只要您知道文件中没有非数字即可安全工作:
read_csv("mtcars.csv", col_types = cols(.default = col_double()))
# Warning in rbind(names(probs), probs_f) :
# number of columns of result is not a multiple of vector length (arg 1)
# Warning: 32 parsing failures.
### ...snip...
# See problems(...) for more details.
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 NA 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 NA 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 NA 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 NA 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 NA 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 NA 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 NA 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 NA 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 NA 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 NA 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows
仅更改 <int>
(col_integer()
) 列,请多加注意。我对 n_max=50
的使用需要平衡。类似于guess_max=
,多一点更好。在这种情况下,如果我选择 n_max=1
,那么前几个 mpg
值将建议整数,这很好。但是,如果您有其他字段与其他 类 不明确,您将需要更多。既然你说的是不想读入整个文件但愿意读入 "a bit" 以获得正确的猜测,我认为你可以在这里使用一个合理的值(100 秒?1000 秒?)对于 chr
和 lgl
.
是稳健的
types <- attr(read_csv("mtcars.csv", n_max=1, col_types = cols(.default = col_guess())), "spec")
(intcols <- sapply(types$cols, identical, col_integer()))
# mpg cyl disp hp drat wt qsec vs am gear carb
# TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
types$cols[intcols] <- replicate(sum(intcols), col_double())
和最后的阅读,注意 $hp
及以后现在是 <dbl>
(与下面的数据准备阅读不同)。
read_csv("mtcars.csv", col_types = types)
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 c6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 c6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 c4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 c6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 c8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 c6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 c8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 c4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 c4 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 c6 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows
数据:
library(readr)
mt <- mtcars
mt$cyl <- paste0("c", mt$cyl) # for fun
write_csv(mt, path = "mtcars.csv")
read_csv("mtcars.csv", col_types = cols(.default = col_guess()))
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <chr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
# 1 21 c6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 c6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 c4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 c6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 c8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 c6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 c8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 c4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 c4 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 c6 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows
data.table::fread
似乎可以正常工作。
write_csv(dsTmp, ttfile <- tempfile())
ans <- fread(ttfile)
tail(ans)
# x
# 1: 0.0
# 2: 0.0
# 3: 0.0
# 4: 0.0
# 5: 0.0
# 6: 0.5
来自 ?fread
帮助页面
Rarely, the file may contain data of a higher type in rows outside the
sample (referred to as an out-of-sample type exception). In this event
fread will automatically reread just those columns from the beginning
so that you don't have the inconvenience of having to set colClasses
yourself;
我把r2evans解法的代码转成一个小函数:
read_csvDouble <- function(
### read_csv but read guessed integer columns as double
... ##<< further arguments to \code{\link{read_csv}}
, n_max = Inf ##<< see \code{\link{read_csv}}
, col_types = cols(.default = col_guess()) ##<< see \code{\link{read_csv}}
## the default suppresses the type guessing messages
){
##details<< Sometimes, double columns are guessed as integer, e.g. with
## runoff data where there are many zeros, an only occasionally
## positive values that can be recognized as double.
## This functions modifies \code{read_csv} by changing guessed integer
## columns to double columns.
#
colTypes <- read_csv(..., n_max = 3, col_types = col_types) %>% attr("spec")
isIntCol <- map_lgl(colTypes$cols, identical, col_integer())
colTypes$cols[isIntCol] <- replicate(sum(isIntCol), col_double())
##value<< tibble as returned by \code{\link{read_csv}}
ans <- read_csv(..., n_max = n_max, col_types = colTypes)
ans
}
我有很多零值的径流数据,偶尔还有一些非零双精度值。
'readr::read_csv' 猜测整数列类型,因为有很多零。
如何让 read_csv 猜出正确的双列类型? 我事先不知道变量名的映射,因此无法给出名称类型映射。
这是一个小例子
# create a column of doubles with many zeros (runoff data)
#dsTmp <- data.frame(x = c(rep(0.0, 2), 0.5)) # this works
dsTmp <- data.frame(x = c(rep(0.0, 1e5), 0.5))
write_csv(dsTmp, "tmp/dsTmp.csv")
# 0.0 is written as 0
# read_csv now guesses integer instead of double and reports
# a parsing failure.
ans <- read_csv("tmp/dsTmp.csv")
# the last value is NA instead of 0.5
tail(ans)
我可以告诉它选择尝试更宽的列类型而不是发出解析失败吗?
问题 645 提到了这个问题,但是给出的解决方法是在写作方面。我在写作方面影响不大
这里有两个技巧。 (底部的数据准备。$hp
和 $vs
及以后是整数列。)
注意:我将 cols(.default=col_guess())
添加到大多数首次调用中,这样我们就不会得到关于 read_csv
发现列是什么的大消息。可以省略它,代价是控制台噪音更大。
使用
cols(.default=...)
设置强制所有列为双列,只要您知道文件中没有非数字即可安全工作:read_csv("mtcars.csv", col_types = cols(.default = col_double())) # Warning in rbind(names(probs), probs_f) : # number of columns of result is not a multiple of vector length (arg 1) # Warning: 32 parsing failures. ### ...snip... # See problems(...) for more details. # # A tibble: 32 x 11 # mpg cyl disp hp drat wt qsec vs am gear carb # <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 21 NA 160 110 3.9 2.62 16.5 0 1 4 4 # 2 21 NA 160 110 3.9 2.88 17.0 0 1 4 4 # 3 22.8 NA 108 93 3.85 2.32 18.6 1 1 4 1 # 4 21.4 NA 258 110 3.08 3.22 19.4 1 0 3 1 # 5 18.7 NA 360 175 3.15 3.44 17.0 0 0 3 2 # 6 18.1 NA 225 105 2.76 3.46 20.2 1 0 3 1 # 7 14.3 NA 360 245 3.21 3.57 15.8 0 0 3 4 # 8 24.4 NA 147. 62 3.69 3.19 20 1 0 4 2 # 9 22.8 NA 141. 95 3.92 3.15 22.9 1 0 4 2 # 10 19.2 NA 168. 123 3.92 3.44 18.3 1 0 4 4 # # ... with 22 more rows
仅更改
是稳健的<int>
(col_integer()
) 列,请多加注意。我对n_max=50
的使用需要平衡。类似于guess_max=
,多一点更好。在这种情况下,如果我选择n_max=1
,那么前几个mpg
值将建议整数,这很好。但是,如果您有其他字段与其他 类 不明确,您将需要更多。既然你说的是不想读入整个文件但愿意读入 "a bit" 以获得正确的猜测,我认为你可以在这里使用一个合理的值(100 秒?1000 秒?)对于chr
和lgl
.types <- attr(read_csv("mtcars.csv", n_max=1, col_types = cols(.default = col_guess())), "spec") (intcols <- sapply(types$cols, identical, col_integer())) # mpg cyl disp hp drat wt qsec vs am gear carb # TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE types$cols[intcols] <- replicate(sum(intcols), col_double())
和最后的阅读,注意
$hp
及以后现在是<dbl>
(与下面的数据准备阅读不同)。read_csv("mtcars.csv", col_types = types) # # A tibble: 32 x 11 # mpg cyl disp hp drat wt qsec vs am gear carb # <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 21 c6 160 110 3.9 2.62 16.5 0 1 4 4 # 2 21 c6 160 110 3.9 2.88 17.0 0 1 4 4 # 3 22.8 c4 108 93 3.85 2.32 18.6 1 1 4 1 # 4 21.4 c6 258 110 3.08 3.22 19.4 1 0 3 1 # 5 18.7 c8 360 175 3.15 3.44 17.0 0 0 3 2 # 6 18.1 c6 225 105 2.76 3.46 20.2 1 0 3 1 # 7 14.3 c8 360 245 3.21 3.57 15.8 0 0 3 4 # 8 24.4 c4 147. 62 3.69 3.19 20 1 0 4 2 # 9 22.8 c4 141. 95 3.92 3.15 22.9 1 0 4 2 # 10 19.2 c6 168. 123 3.92 3.44 18.3 1 0 4 4 # # ... with 22 more rows
数据:
library(readr)
mt <- mtcars
mt$cyl <- paste0("c", mt$cyl) # for fun
write_csv(mt, path = "mtcars.csv")
read_csv("mtcars.csv", col_types = cols(.default = col_guess()))
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <chr> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
# 1 21 c6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 c6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 c4 108 93 3.85 2.32 18.6 1 1 4 1
# 4 21.4 c6 258 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 c8 360 175 3.15 3.44 17.0 0 0 3 2
# 6 18.1 c6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 c8 360 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 c4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 c4 141. 95 3.92 3.15 22.9 1 0 4 2
# 10 19.2 c6 168. 123 3.92 3.44 18.3 1 0 4 4
# # ... with 22 more rows
data.table::fread
似乎可以正常工作。
write_csv(dsTmp, ttfile <- tempfile())
ans <- fread(ttfile)
tail(ans)
# x
# 1: 0.0
# 2: 0.0
# 3: 0.0
# 4: 0.0
# 5: 0.0
# 6: 0.5
来自 ?fread
帮助页面
Rarely, the file may contain data of a higher type in rows outside the sample (referred to as an out-of-sample type exception). In this event fread will automatically reread just those columns from the beginning so that you don't have the inconvenience of having to set colClasses yourself;
我把r2evans解法的代码转成一个小函数:
read_csvDouble <- function(
### read_csv but read guessed integer columns as double
... ##<< further arguments to \code{\link{read_csv}}
, n_max = Inf ##<< see \code{\link{read_csv}}
, col_types = cols(.default = col_guess()) ##<< see \code{\link{read_csv}}
## the default suppresses the type guessing messages
){
##details<< Sometimes, double columns are guessed as integer, e.g. with
## runoff data where there are many zeros, an only occasionally
## positive values that can be recognized as double.
## This functions modifies \code{read_csv} by changing guessed integer
## columns to double columns.
#
colTypes <- read_csv(..., n_max = 3, col_types = col_types) %>% attr("spec")
isIntCol <- map_lgl(colTypes$cols, identical, col_integer())
colTypes$cols[isIntCol] <- replicate(sum(isIntCol), col_double())
##value<< tibble as returned by \code{\link{read_csv}}
ans <- read_csv(..., n_max = n_max, col_types = colTypes)
ans
}