解析 groovy 中的 CSV 并具有异常容忍度
Parsing CSV in groovy with exception tolerance
我一直在尝试解析 groovy 中的 csv 文件,目前使用的是库 org.apache.commons.csv 2.4。我的要求是 csv 单元格中存在无效数据值,例如无效字符,而不是在第一个无效 row/cell 时抛出异常,我想收集这些异常并在 csv 文件中不断迭代,直到结束,然后我将获得此 csv 文件包含的无效数据的完整列表。
出于这个目的,我尝试了多种使用此 apache 库的方法,但不幸的是,只要它使用 CSVParser.getNextRecord() 进行迭代,迭代器就会中止。
输入代码,像这样:
def records = new CSVParser(reader, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces())
// at this line, the iterator() inside CSVParser is always using getNextRecord() for its next() implementation, and it may throw exception on invalid char
records.each {record->
// if the exception is thrown from .each, that makes below try/catch in vain
try{
}catch(e){ //want collect Errors here }
}
那么,这个图书馆还有什么我应该挖掘的吗?或者有人能指出我另一个更可行的解决方案吗?非常感谢大家!
更新:
CSV 样本
"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"
"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1001","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
第二个数据行包含无效字符 "
导致解析器抛出异常
问题是,如果 csv 文件包含无效数据,即违反 csv 格式规则的数据,则解析器无法...解析。这就是为什么它无法可靠地解析遇到的第一个错误。
您遇到的问题是,一个单元格中的字符之一是解析器根据所选格式使用的 quote
字符:CSVFormat.EXCEL
.
引号字符是
the character used to encapsulate values containing special characters
因此在您的示例中引用被误用,解析器对此进行了抱怨。
您可以使用不同的 CSVFormat
解决此问题。例如,一个没有引号字符:
@Grapes(
@Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)
import java.nio.charset.*
import org.apache.commons.csv.*
def text = '''"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"
"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1002","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
"1003","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"'''
def parsed = CSVParser.parse(text, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces().withQuote(null))
parsed.getRecords().each {
println it.toMap().values()
}
以上结果:
[]
["0000016400", "1001", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
["0000016400", "1002", "RE-01768-011", "Opex - To present a paper on "Career con", "X", "PR00031497"]
["0000016400", "1003", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
当然,通过上述解决方法,您可以在每个字段中包含 引号 ("
)。
如果你愿意,你可以全部替换它们:
parsed.getRecords().each {
println it.toMap().values().collect({ it.replace('"', '') })
}
我一直在尝试解析 groovy 中的 csv 文件,目前使用的是库 org.apache.commons.csv 2.4。我的要求是 csv 单元格中存在无效数据值,例如无效字符,而不是在第一个无效 row/cell 时抛出异常,我想收集这些异常并在 csv 文件中不断迭代,直到结束,然后我将获得此 csv 文件包含的无效数据的完整列表。
出于这个目的,我尝试了多种使用此 apache 库的方法,但不幸的是,只要它使用 CSVParser.getNextRecord() 进行迭代,迭代器就会中止。
输入代码,像这样:
def records = new CSVParser(reader, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces())
// at this line, the iterator() inside CSVParser is always using getNextRecord() for its next() implementation, and it may throw exception on invalid char
records.each {record->
// if the exception is thrown from .each, that makes below try/catch in vain
try{
}catch(e){ //want collect Errors here }
}
那么,这个图书馆还有什么我应该挖掘的吗?或者有人能指出我另一个更可行的解决方案吗?非常感谢大家!
更新: CSV 样本
"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"
"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1001","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
第二个数据行包含无效字符 "
导致解析器抛出异常
问题是,如果 csv 文件包含无效数据,即违反 csv 格式规则的数据,则解析器无法...解析。这就是为什么它无法可靠地解析遇到的第一个错误。
您遇到的问题是,一个单元格中的字符之一是解析器根据所选格式使用的 quote
字符:CSVFormat.EXCEL
.
引号字符是
the character used to encapsulate values containing special characters
因此在您的示例中引用被误用,解析器对此进行了抱怨。
您可以使用不同的 CSVFormat
解决此问题。例如,一个没有引号字符:
@Grapes(
@Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)
import java.nio.charset.*
import org.apache.commons.csv.*
def text = '''"Company code for WBS element","WBS Element","PS: Short description (1st text line)","Responsible Cost Center for WBS Element","OBJNR","WBS Status"
"1001","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"
"1002","RE-01768-011","Opex - To present a paper on "Career con","0000016400","PR00031497","X"
"1003","RE-01768-011","Opex - To present a paper on Career con","0000016400","PR00031497","X"'''
def parsed = CSVParser.parse(text, CSVFormat.EXCEL.withHeader().withIgnoreSurroundingSpaces().withQuote(null))
parsed.getRecords().each {
println it.toMap().values()
}
以上结果:
[]
["0000016400", "1001", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
["0000016400", "1002", "RE-01768-011", "Opex - To present a paper on "Career con", "X", "PR00031497"]
["0000016400", "1003", "RE-01768-011", "Opex - To present a paper on Career con", "X", "PR00031497"]
当然,通过上述解决方法,您可以在每个字段中包含 引号 ("
)。
如果你愿意,你可以全部替换它们:
parsed.getRecords().each {
println it.toMap().values().collect({ it.replace('"', '') })
}