fread:na.strings 中的空字符串 ("") 未被解释为 NA

fread: empty string ("") in na.strings is not interpreted as NA

如何让 fread() 将所有变量(包括字符变量)的 "" 设置为 NA

我正在导入一个 .csv 文件,其中缺失值为空字符串("";无 space)。我希望 "" 被解释为缺失值 NA 并尝试了 `na.strings = "" 但没有成功:

data <- fread("file.csv", na.strings = "")

unique(data$character_variable)
# [1] "abc" "def"      ""            

另一方面,当我将 read.csvna.strings = "" 一起使用时,"" 会变成 NA,即使对于字符变量也是如此。这就是我想要的结果。

data <- read.csv("file.csv", na.strings = "")

unique(data$character_variable)
# [1] "abc" "def"      NA

版本

嗯,如果你的 csv 文件看起来像这样,你就不能

a,b
x,y
"",1

请注意,"" 中的任何内容都被视为字符串文字,因为 "" 是转义字符。从这个意义上讲,csv 文件中的 ,"", 仅表示空字符串,而不是缺失值(即 ,,)。我认为这是一个很好的一致性特性。 fread:

的文档na.strings部分也写了这个

A character vector of strings which are to be interpreted as NA values. By default, ",," for columns of all types, including type character is read as NA for consistency. ,"", is unambiguous and read as an empty string. To read ,NA, as NA, set na.strings="NA". To read ,, as blank string "", set na.strings=NULL. When they occur in the file, the strings in na.strings should not appear quoted since that is how the string literal ,"NA", is distinguished from ,NA,, for example, when na.strings="NA".

另一方面,您可能会注意到,如果文件看起来像这样

a,b
1,y
"",1

,则空字符串将被转换为NA。但是,我认为这不是错误,因为这种行为可能是解析器类型强制的结果。在同一文档的 Details 部分,您可以看到

The lowest type for each column is chosen from the ordered list: logical, integer, integer64, double, character.

所以第 a 列首先被读取为字符列,然后转换为整数列。空字符串仍按原样读取,但在第二步中被强制转换为 NA_integer_