解析 groovy 中的 CSV 并具有异常容忍度

Question

我一直在尝试解析 groovy 中的 csv 文件，目前使用的是库 org.apache.commons.csv 2.4。我的要求是 csv 单元格中存在无效数据值，例如无效字符，而不是在第一个无效 row/cell 时抛出异常，我想收集这些异常并在 csv 文件中不断迭代，直到结束，然后我将获得此 csv 文件包含的无效数据的完整列表。

出于这个目的，我尝试了多种使用此 apache 库的方法，但不幸的是，只要它使用 CSVParser.getNextRecord() 进行迭代，迭代器就会中止。

输入代码，像这样：

    def  records = new CSVParser(reader, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces())

     // at this line, the iterator() inside CSVParser is always using getNextRecord() for its next() implementation, and it may throw exception on invalid char
     records.each {record->
         // if the exception is thrown from .each, that makes below try/catch in vain
         try{

         }catch(e){ //want collect Errors here }
     }

那么，这个图书馆还有什么我应该挖掘的吗？或者有人能指出我另一个更可行的解决方案吗？非常感谢大家！

更新： CSV 样本

"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1001","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"

第二个数据行包含无效字符 " 导致解析器抛出异常

Answer 1

问题是，如果 csv 文件包含无效数据，即违反 csv 格式规则的数据，则解析器无法...解析。这就是为什么它无法可靠地解析遇到的第一个错误。

Answer 2

您遇到的问题是，一个单元格中的字符之一是解析器根据所选格式使用的 quote 字符：CSVFormat.EXCEL.

引号字符是

the character used to encapsulate values containing special characters

因此在您的示例中引用被误用，解析器对此进行了抱怨。

您可以使用不同的 CSVFormat 解决此问题。例如，一个没有引号字符：

@Grapes(
    @Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)

import java.nio.charset.*
import org.apache.commons.csv.*

def text = '''"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1002","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
"1003","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"'''

def parsed = CSVParser.parse(text, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces().withQuote(null))

parsed.getRecords().each {
    println it.toMap().values()
}

以上结果：

[]
["0000016400", "1001", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
["0000016400", "1002", "RE-01768-011", "Opex - To present a paper on "Career con", "X", "PR00031497"]
["0000016400", "1003", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]

当然，通过上述解决方法，您可以在每个字段中包含引号 (")。

如果你愿意，你可以全部替换它们：

parsed.getRecords().each {
    println it.toMap().values().collect({ it.replace('"', '') })
}

解析 groovy 中的 CSV 并具有异常容忍度

Parsing CSV in groovy with exception tolerance

csv

groovy

apache-commons-csv