解析 groovy 中的 CSV 并具有异常容忍度

Parsing CSV in groovy with exception tolerance

我一直在尝试解析 groovy 中的 csv 文件,目前使用的是库 org.apache.commons.csv 2.4。我的要求是 csv 单元格中存在无效数据值,例如无效字符,而不是在第一个无效 row/cell 时抛出异常,我想收集这些异常并在 csv 文件中不断迭代,直到结束,然后我将获得此 csv 文件包含的无效数据的完整列表。

出于这个目的,我尝试了多种使用此 apache 库的方法,但不幸的是,只要它使用 CSVParser.getNextRecord() 进行迭代,迭代器就会中止。

输入代码,像这样:

    def  records = new CSVParser(reader, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces())

     // at this line, the iterator() inside CSVParser is always using getNextRecord() for its next() implementation, and it may throw exception on invalid char
     records.each {record->
         // if the exception is thrown from .each, that makes below try/catch in vain
         try{

         }catch(e){ //want collect Errors here }
     }

那么,这个图书馆还有什么我应该挖掘的吗?或者有人能指出我另一个更可行的解决方案吗?非常感谢大家!

更新: CSV 样本

"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1001","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"

第二个数据行包含无效字符 " 导致解析器抛出异常

问题是,如果 csv 文件包含无效数据,即违反 csv 格式规则的数据,则解析器无法...解析。这就是为什么它无法可靠地解析遇到的第一个错误。

您遇到的问题是,一个单元格中的字符之一是解析器根据所选格式使用的 quote 字符:CSVFormat.EXCEL.

引号字符

the character used to encapsulate values containing special characters

因此在您的示例中引用被误用,解析器对此进行了抱怨。

您可以使用不同的 CSVFormat 解决此问题。例如,一个没有引号字符:

@Grapes(
    @Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)

import java.nio.charset.*
import org.apache.commons.csv.*

def text = '''"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"

"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1002","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
"1003","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"'''

def parsed = CSVParser.parse(text, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces().withQuote(null))

parsed.getRecords().each {
    println it.toMap().values()
}

以上结果:

[]
["0000016400", "1001", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
["0000016400", "1002", "RE-01768-011", "Opex - To present a paper on "Career con", "X", "PR00031497"]
["0000016400", "1003", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]

当然,通过上述解决方法,您可以在每个字段中包含 引号 (")。

如果你愿意,你可以全部替换它们:

parsed.getRecords().each {
    println it.toMap().values().collect({ it.replace('"', '') })
}