R:为什么,如何避免:read.table 通过删除最后一个字符(冒号)将字符(字符串)转换为数字

R: why, how to avoid: read.table turns character (strings) to numeric by removing last character (colon)

有一个数据框,我想将其导出为 CSV 并重新导入到数据框。当导入一列损坏时——通过从字符串末尾删除冒号,并将它们解释为数字。

这是一个最小的例子:

df <- data.frame(integers = c(1:8, NA, 10L),
                 doubles  = as.numeric(paste0(c(1:7, NA, 9, 10), ".1")),
                 strings = paste0(c(1:10),".")
                 )
df
str(df) # here the last column is "chr"

write.table(df,
            file = "df.csv",
            sep = "\t",
            na = "NA",
            row.names = FALSE,
            col.names = TRUE,
            fileEncoding = "UTF-8",
)

df <- read.table(file = "df.csv",
                 header = TRUE,
                 sep = "\t",
                 na.strings = "NA",
                 quote="\"",
                 fileEncoding = "UTF-8"
                 )
df
str(df)  # here the last column is "num"

有了read.table,我们就可以在?vector

中指定colClasses

The atomic modes are "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw".

问题是?read.table colClasses如果不指定使用type.convert自动判断列的类型

Unless colClasses is specified, all columns are read as character columns and then converted using type.convert to logical, integer, numeric, complex or (depending on as.is) factor as appropriate.

read.table中的相关代码为

...
     do[1L] <- FALSE
    for (i in (1L:cols)[do]) {
        data[[i]] <- if (is.na(colClasses[i])) 
            type.convert(data[[i]], as.is = as.is[i], dec = dec, 
                numerals = numerals, na.strings = character(0L))
        else if (colClasses[i] == "factor") 
            as.factor(data[[i]])
        else if (colClasses[i] == "Date") 
            as.Date(data[[i]])
        else if (colClasses[i] == "POSIXct") 
            as.POSIXct(data[[i]])
        else methods::as(data[[i]], colClasses[i])
    }
...
df <- read.table(file = "df.csv",
                 header = TRUE,
                 sep = "\t",
                 na.strings = "NA",
                 quote="\"",
                 fileEncoding = "UTF-8", 
           colClasses = c("integer", "numeric", "character")
                 )

-检查结构

str(df)
'data.frame':   10 obs. of  3 variables:
 $ integers: int  1 2 3 4 5 6 7 8 NA 10
 $ doubles : num  1.1 2.1 3.1 4.1 5.1 6.1 7.1 NA 9.1 10.1
 $ strings : chr  "1." "2." "3." "4." ...