具有重复列的 Apache CSV 解析器 headers

Question

我需要处理具有重复 headers 的 CSV 文件，每个数据都在三列中（最小值、最大值和平均值），但是每一列的 header 是相同的。第一列是最小值，第二列是平均值，第三列是最大值。

Apache CSV 解析器抛出：

java.lang.IllegalArgumentException: The header contains a duplicate name:

如何配置解析器以接受重复的 header？

Answer 1

CSVParser 中没有 pre-defined 配置参数来控制是否可以接受重复的列名。

查看源代码表明 initializeHeader 方法创建了一个 Map，它将列名作为键，列索引作为值。如果要使用 header 映射，列名必须是唯一的。

不过，还是有解决办法的：

指定忽略 CSV 文件第一行定义的列名称的 CSVFormat，以及 define your column names manually.

来自CSVFormat documentation:

Defining column names

To define the column names you want to use to access records, write:
CSVFormat.EXCEL.withHeader("Col1", "Col2", "Col3");
Calling withHeader(String...) let's you use the given names to address values in a CSVRecord, and assumes that your CSV source does not contain a first record that also defines column names. If it does, then you are overriding this metadata with your names and you should skip the first record by calling withSkipHeaderRecord(boolean) with true.

Answer 2

现在可以配置 CSVParser 以允许重复 headers。

CSVFormat csvFormat = CSVFormat.withAllowDuplicateHeaderNames()

具有重复列的 Apache CSV 解析器 headers

Apache CSV parser with duplicate column headers

csv

apache-commons