使用 rbindlist 时遇到错误：结果的第 25 列被确定为 integer64 但 maxType == 'Character' !=REALSXP

Question

我使用以下函数将目录中的所有 .csv 文件合并到一个数据框中：

multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,fread),fill = TRUE) }

dataframe = multmerge(path)

此代码产生此错误：

Error in rbindlist(lapply(filenames, fread), fill = TRUE) : Internal error: column 25 of result is determined to be integer64 but maxType=='character' != REALSXP

该代码之前曾处理过相同的 csv 文件...我不确定发生了什么变化以及错误消息的含义。

Answer 1

所以在查看 fread 的文档时，我刚刚注意到有一个 integer64 选项，所以你处理的是大于 2^31 的整数吗？

编辑：我添加了 tryCatch，它将向控制台打印一条格式化消息，指示哪些文件导致错误以及实际错误消息。但是，为了让 rbindlist 然后在正常文件上执行，您需要创建一个虚拟列表，该列表将产生一个名为 ERROR 的额外列，该列将在所有行中具有 NA，除了底部的行将以问题文件的名称作为其名称值。

我建议您运行完成一次此代码后，从 data.table 中删除 ERROR 列和额外的行，然后将此组合文件另存为 .csv。然后我会将所有正确组合的文件移动到不同的文件夹中，并且只有当前组合文件和路径中未正确加载的文件。然后重新运行没有指定 colClasses 的函数。我将所有内容合并到一个脚本中，希望它不会造成混淆：

#First Initial run without colClasses

  multmerge = function(mypath){
        filenames = list.files(path = mypath, full.names = TRUE)
        rbindlist(lapply(filenames,function(i) tryCatch(fread(i),
                                                        error = function(e) {
                                                                 cat("\nError reading in file:",i,"\t") #Identifies problem files by name
                                                                 message(e) #Prints error message without stopping loop
                                                                 list(ERROR=i) #Adds a placeholder column so rbindlist will execute
                                                                 })), #End of tryCatch and lapply 
                   fill = TRUE) #rbindlist arguments
    } #End of function

 #You should get the original error message and identify the filename.
  dataframe = multmerge(path)
 #Delete placeholder column and extra rows 
 #You will get as many extra rows as you have problem files - 
 #most likely just the one with column 25 or any others that had that same issue with column 25. 
 #Note the out of bounds error message will probably go away with the colClasses argument pulled out.)

 #Save this cleaned file to something like: fwrite(dataframe,"CurrentCombinedData.csv")
 #Move all files but problem file into new folder
 #Now you should only have the big one and only one in your path.
 #Rerun the function but add the colClasses argument this time

#Second run to accommodate the problem file(s) - We know the column 25 error this time but maybe in the future you will have to adapt this by adding the appropriate column.

  multmerge = function(mypath){
        filenames = list.files(path = mypath, full.names = TRUE)
        rbindlist(lapply(filenames,function(i) tryCatch(fread(i,colClasses = list(character = c(25))),
                                                        error = function(e) {
                                                                 cat("\nError reading in file:",i,"\t") #Identifies problem files by name
                                                                 message(e) #Prints error message without stopping loop
                                                                 list(ERROR=i) #Adds a placeholder column so rbindlist will execute
                                                                 })), #End of tryCatch and lapply
                   fill = TRUE) #rbindlist arguments
    } #End of function

   dataframe2 = multmerge(path)

现在我们知道错误的来源是我们可以在 colClasses 中指定的第 25 列。如果你运行代码并且你得到不同列的相同错误消息，只需在 25 之后添加该列的编号。输入数据框后，我将检查该列（或任何其他列（如果您必须添加其他列）。可能其中一个文件中存在数据输入错误或 NA 值的不同编码。这就是为什么我说首先将该列首先转换为 character 的原因，因为与首先转换为 numeric 相比，您丢失的信息更少。

一旦没有错误，请始终将清理后的组合 data.table 写入文件夹中包含的 csv，并始终将已组合的单个文件移动到另一个文件夹中。这样，当您添加新文件时，您只会合并大文件和其他几个文件，以便将来您可以更轻松地了解发生了什么。只需记下哪些文件给您带来了麻烦以及哪些列。那有意义吗？

由于文件通常非常特殊，因此您必须灵活处理，但这种工作流程方法应该可以轻松识别有问题的文件，并将您需要添加的内容添加到 fread 中以使其正常工作。基本上存档已处理的文件并跟踪异常情况，如第 25 列，并将最新的组合文件和未处理的文件一起保存在活动路径中。希望对您有所帮助，祝您好运！

使用 rbindlist 时遇到错误：结果的第 25 列被确定为 integer64 但 maxType == 'Character' !=REALSXP

Error encountered with using rbindlist: column 25 of result is determined to be integer64 but maxType == 'Character' !=REALSXP

r

data.table

rbindlist