使用 univocity 解析两个不同的 csv 文件并写入新的 csv 文件

Using univocity to parse two different csv files and write into new csv file

我总是在我的 java 程序中使用 univocity 解析器来比较 csv 文件。它运行良好并且速度更快。

但问题是,这次我试图解析两个不同的具有复杂值的大容量 csv 文件,并在新的 csv 文件中打印差异,

查看作者的一个示例,我在将 file1 读入列表然后转换为映射后尝试使用 processFile,但在解析时仍然出现错误。

下面是我的示例输入和预期的输出文件。

输入 - 文件 1

"h1","h2","h3","h4","h5"
"00000","US","9503.00.0089","USA","9503.0089"
"","EU","9503.00.7000","EUROPEAN UNION","9503.00.7000"
"#1200","US","5601.22.0010","USA","5601.22.0010"
"0180691","US","9503.00.0073","USA","9503.00.0073"
“DRTY01”,”CA”,”9603.01.0088”,”CAN”,”9603.01.0088”

输入 - 文件 2

"h1","h2","h3","h6","h7","h8","h9","h10",h11 
"018890","US","","2015","101","1","1","All",””
"00000","US","9503.00.0090","1986","101","1","1","All","9503.00.0090"
"0180691","US","9503.00.0073","2019","101","1","1","All","9503.00.0073”
“DRTY01”,”CA”,”9603.01.0087”,”2002”,”102”,”1”,”2”,”CA”, “9603.01.0087”

选择 file1 和 file2 中的 h1、h2 公共值,然后比较 file1 的 h3 和 file2 的 h3,如果两个文件的 h3 不相等,那么我想打印“h1”,“h4”,“h10”,” h5”、”h11”、”h6”、”h7”、”h8”、”h9” 到文件 3

输出 - 文件 3

“h1”,”h4”,” h10”,”h5”, ”h11”,”h6”,”h7”,”h8”,”h9”
"00000","USA”,”All”,”9503.00.0089”,”9503.00.0090”, "1986","101","1","1"   
"DRTY01”,“CAN”,”CA”,”9603.01.0088”,“9603.01.0087”,”2002”,”102”,”1”,”2”

我有解决你问题的方法,但请进行回归测试。所以我假设 h1 和 h2 的组合将是一个唯一值。我正在创建一个 HashMap,其中的地图作为键,csv 文件的整行作为值。我们将覆盖创建的 class 的 hashcode 和 equals 方法,例如:

  • 哈希码只会使用 h1 和 h2 来生成代码(因为它们肯定是唯一的)
  • equals 我们也将 h3 用作比较条件,当两个 h3 相同时 return false。

equals 中的逻辑是 - 如果 map1 和 map2 中的 h1 和 h2 相同,而 h3 不同,请给我 map1 和 map2 中的行。此逻辑在地图中使用额外的 space,但整体计算逻辑减少到 O(N)。下面的代码会给你你想要的行 maps.I 没有正确执行 IO 和异常处理,请相应地处理它们。

测试class

public class UnivocityTest
{

    public static void main(String[] args) throws FileNotFoundException
    {
        // Get data from csv file1
        List<String[]> f1 = getData("example.csv");
        // Get data from csv file2


       List<String[]> f2 = getData("example1.csv");

        // Convert data to a Map with HeaderList class and entire row.
        Map<HeaderList, String[]> map1 = convertAndReturn(f1);
        Map<HeaderList, String[]> map2 = convertAndReturn(f2);

        //Currently prints the required rows.
        compareData(map1, map2);
    }

    // Convert csv to List<String[]>
    private static List<String[]> getData(String file) throws FileNotFoundException
    {
        CsvParserSettings parserSettings = new CsvParserSettings();
        parserSettings.setLineSeparatorDetectionEnabled(true);
        RowListProcessor rowProcessor = new RowListProcessor();
        parserSettings.setProcessor(rowProcessor);
        parserSettings.setHeaderExtractionEnabled(true);

        CsvParser parser = new CsvParser(parserSettings);
        parser.parse(getReader(file));
        // String[] headers = rowProcessor.getHeaders();
        List<String[]> rows = rowProcessor.getRows();

        return rows;
    }

    // get reader object
    private static Reader getReader(String string) throws FileNotFoundException
    {
        // TODO Add proper file handling and exception handling
        return new FileReader(new File(string));
    }

    // Return HashMap
    private static Map<HeaderList, String[]> convertAndReturn(List<String[]> f1)
    {
        Map<HeaderList, String[]> map = new java.util.HashMap<>();

        for (String[] each : f1)
        {
            // For each row in csv create a corresponding HeaderList object with h1,h2 and h3 as key
            // and row as value.
            HeaderList header = new HeaderList(each[0], each[1], each[2]);
            map.put(header, each);
        }

        return map;
    }

    private static void compareData(Map<HeaderList, String[]> map1, Map<HeaderList, String[]> map2)
    {
        // Iterates over the map1 keys one by one. For each key we check if there is a matching key
        // in map2. The matching condition will be h1 and h2 should be same while h3 should be
        // different. Once a key like that is found currently I'm printing both the rows, here you
        // can get the rows you want from the map and return them.

        for (HeaderList each : map1.keySet())
        {
            if (map2.containsKey(each))
            {
//TODO Assume you want columns h3,h4 from file1 and h6  h7 from file2.
                //We know map1 represents file1 with columns h3 and h4 at positions 2 and 3 inside the String[]
                //We know map2 represents file1 with columns h6 and h7 at positions 3 and 4 inside the String[]
                String h3FromFile1 = map1.get(each)[2];
                String h4FromFile1 = map1.get(each)[3];
                String h6FromFile2 = map2.get(each)[3];
                String h7FromFile2 = map2.get(each)[4];
                System.out.println("Required Columns: ");
                System.out.println("h3 file1: "+ h3FromFile1);
                System.out.println("h4 file1: "+ h4FromFile1);
                System.out.println("h6 file2: "+ h6FromFile2);
                System.out.println("h7 file2: " + h7FromFile2);
                System.out.println(Arrays.toString(map1.get(each)));
                System.out.println(Arrays.toString(map2.get(each)));
                System.out.println("-------------------------------");
            }
        }
    }

}

将具有三列 h1、h2、h3 的 bean class:

class HeaderList
        {

            private String h1;

            private String h2;

            private String h3;

            public HeaderList(String h1, String h2, String h3)
            {
                super();
                this.h1 = h1;
                this.h2 = h2;
                this.h3 = h3;
            }

            /**
             * The hash code method which generate same hashkey for h1 and h2.
             * 
             * @inheritDoc
             */
            @Override
            public int hashCode()
            {
                final int prime = 31;
                int result = 1;
                result = prime * result + ((h1 == null) ? 0 : h1.hashCode());
                result = prime * result + ((h2 == null) ? 0 : h2.hashCode());
                return result;
            }

            /**
             * The equals method assumes each csv file row will be uniquely identified my h1 and h2
             * combined. Please see if h1 and h2 cannot be uniquely identified then it may lead to data
             * loss. For h3 we return true only for same values.
             * 
             * @inheritDoc
             */
            @Override
            public boolean equals(Object obj)
            {
                if (this == obj)
                    return true;
                if (obj == null)
                    return false;
                if (getClass() != obj.getClass())
                    return false;
                HeaderList other = (HeaderList) obj;
                if (h1 == null)
                {
                    if (other.h1 != null)
                        return false;
                }
                else if (!h1.equals(other.h1))
                    return false;
                if (h2 == null)
                {
                    if (other.h2 != null)
                        return false;
                }
                else if (!h2.equals(other.h2))
                    return false;
                if (h3 == null)
                {
                    if (other.h3 == null)
                        return false;
                }
                else if (h3.equals(other.h3))
                    return false;
                return true;
            }

            /**
             * @inheritDoc
             */
            @Override
            public String toString()
            {
                return "HeaderList [h1=" + h1 + ", h2=" + h2 + ", h3=" + h3 + "]";
            }

        }

给定输入 csv 文件的输出:

Required Columns: 
h3 file1: 9603.01.0088
h4 file1: CAN
h6 file2: 2002
h7 file2: 102
[DRTY01, CA, 9603.01.0088, CAN, 9603.01.0088]
[DRTY01, CA, 9603.01.0087, 2002, 102, 1, 2, CA, 9603.01.0087]
-------------------------------
Required Columns: 
h3 file1: 9503.00.0089
h4 file1: USA
h6 file2: 1986
h7 file2: 101
[00000, US, 9503.00.0089, USA, 9503.0089]
[00000, US, 9503.00.0090, 1986, 101, 1, 1, All, 9503.00.0090]
-------------------------------