使用 Univocity 处理“”、“-”CSV
Handling "", "-" CSV with Univocity
知道如何获得正确的台词吗?有些线粘在一起了,我不知道如何停止它或为什么。
col. 0: Date
col. 1: Col2
col. 2: Col3
col. 3: Col4
col. 4: Col5
col. 5: Col6
col. 6: Col7
col. 7: Col7
col. 8: Col8
col. 0: 2017-05-23
col. 1: String
col. 2: lo rem ipsum
col. 3: dolor sit amet
col. 4: mcdonalds.com/online.html
col. 5: null
col. 6: "","-""-""2017-05-23"
col. 7: String
col. 8: lo rem ipsum
col. 9: dolor sit amet
col. 10: burgerking.com
col. 11: https://burgerking.com/
col. 12: 20
col. 13: 2
col. 14: fake
col. 0: 2017-05-23
col. 1: String
col. 2: lo rem ipsum
col. 3: dolor sit amet
col. 4: wendys.com
col. 5: null
col. 6: "","-""-""2017-05-23"
col. 7: String
col. 8: lo rem ipsum
col. 9: dolor sit amet
col. 10: buggagump.com
col. 11: null
col. 12: "","-""-""2017-05-23"
col. 13: String
col. 14: cheese
col. 15: ad eum
col. 16: mcdonalds.com/online.html
col. 17: null
col. 18: "","-""-""2017-05-23"
col. 19: String
col. 20: burger
col. 21: ludus dissentiet
col. 22: www.mcdonalds.com
col. 23: https://www.mcdonalds.com/
col. 24: 25
col. 25: 3
col. 26: fake
col. 0: 2017-05-23
col. 1: String
col. 2: wine
col. 3: id erat utamur
col. 4: bubbagump.com
col. 5: https://buggagump.com/
col. 6: 25
col. 7: 3
col. 8: fake
done
CSV 示例(\r\n 在 copy/pasting 时可能已损坏)。此处可用:https://www.dropbox.com/s/86klza4qok4ty2s/malformed%20csv%20r%20n%20small.csv?dl=0
"Date","Col2","Col3","Col4","Col5","Col6","Col7","Col7","Col8"
"2017-05-23","String","lo rem ipsum","dolor sit amet","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","burgerking.com","https://burgerking.com/","20","2","fake"
"2017-05-23","String","lo rem ipsum","dolor sit amet","wendys.com","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","buggagump.com","","-","-","-"
"2017-05-23","String","cheese","ad eum","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","burger","ludus dissentiet","www.mcdonalds.com","https://www.mcdonalds.com/","25","3","fake"
"2017-05-23","String","wine","id erat utamur","bubbagump.com","https://buggagump.com/","25","3","fake"
建筑设置:
CsvParserSettings settings = new CsvParserSettings();
settings.setDelimiterDetectionEnabled(true);
settings.setQuoteDetectionEnabled(true);
settings.setLineSeparatorDetectionEnabled(false); // all the same using `true`
settings.getFormat().setLineSeparator("\r\n");
CsvParser parser = new CsvParser(settings);
List<String[]> rows;
rows = parser.parseAll(getReader("testFiles/" + "malformed csv small.csv"));
for (String[] row : rows)
{
System.out.println("");
int i = 0;
for (String element : row)
{
System.out.println("col. " + i++ + ": " + element);
}
}
System.out.println("done");
由于您正在测试自动检测过程,我建议您打印出检测到的格式:
CsvFormat format = parser.getDetectedFormat();
System.out.println(format);
这将打印出:
CsvFormat:
Comment character=#
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=-
Quote escape escape character=null
如您所见,解析器未正确检测引号转义。虽然格式检测过程通常非常好,但不能保证它总是正确的,特别是对于小测试样本。在您的样本中,我不明白为什么它会选择 -
作为转义字符,所以我打开这个 issue 来调查并查看是什么让它检测到那个字符。
如果您知道输入文件中的 none 永远不会有 -
作为引号转义,您现在可以做的解决方法是检测格式,测试它从输入中获取的内容,然后解析内容,如下所示:
public List<String[]> parse(File input, CsvFormat format) {
CsvParserSettings settings = new CsvParserSettings();
if (format == null) { //no format specified? Let's detect what we are dealing with
settings.detectFormatAutomatically();
CsvParser parser = new CsvParser(settings);
parser.beginParsing(input); //just call begin parsing to kick of the auto-detection process
format = parser.getDetectedFormat(); //capture the format
parser.stopParsing(); //stop the parser - no need to read anything yet.
System.out.println(format);
if (format.getQuoteEscape() == '-') { //got something weird detected? Let's amend it.
format.setQuoteEscape('"');
}
return parse(input, format); //now parse with the intended format
} else {
settings.setFormat(format); //this parses with the format adjusted earlier.
CsvParser parser = new CsvParser(settings);
return parser.parseAll(input);
}
}
现在只需调用 parse
方法:
List<String[]> rows = parse(new File("/Users/jbax/Downloads/malformed csv r n small.csv"), null);
您将正确提取数据。希望这对您有所帮助!
知道如何获得正确的台词吗?有些线粘在一起了,我不知道如何停止它或为什么。
col. 0: Date
col. 1: Col2
col. 2: Col3
col. 3: Col4
col. 4: Col5
col. 5: Col6
col. 6: Col7
col. 7: Col7
col. 8: Col8
col. 0: 2017-05-23
col. 1: String
col. 2: lo rem ipsum
col. 3: dolor sit amet
col. 4: mcdonalds.com/online.html
col. 5: null
col. 6: "","-""-""2017-05-23"
col. 7: String
col. 8: lo rem ipsum
col. 9: dolor sit amet
col. 10: burgerking.com
col. 11: https://burgerking.com/
col. 12: 20
col. 13: 2
col. 14: fake
col. 0: 2017-05-23
col. 1: String
col. 2: lo rem ipsum
col. 3: dolor sit amet
col. 4: wendys.com
col. 5: null
col. 6: "","-""-""2017-05-23"
col. 7: String
col. 8: lo rem ipsum
col. 9: dolor sit amet
col. 10: buggagump.com
col. 11: null
col. 12: "","-""-""2017-05-23"
col. 13: String
col. 14: cheese
col. 15: ad eum
col. 16: mcdonalds.com/online.html
col. 17: null
col. 18: "","-""-""2017-05-23"
col. 19: String
col. 20: burger
col. 21: ludus dissentiet
col. 22: www.mcdonalds.com
col. 23: https://www.mcdonalds.com/
col. 24: 25
col. 25: 3
col. 26: fake
col. 0: 2017-05-23
col. 1: String
col. 2: wine
col. 3: id erat utamur
col. 4: bubbagump.com
col. 5: https://buggagump.com/
col. 6: 25
col. 7: 3
col. 8: fake
done
CSV 示例(\r\n 在 copy/pasting 时可能已损坏)。此处可用:https://www.dropbox.com/s/86klza4qok4ty2s/malformed%20csv%20r%20n%20small.csv?dl=0
"Date","Col2","Col3","Col4","Col5","Col6","Col7","Col7","Col8"
"2017-05-23","String","lo rem ipsum","dolor sit amet","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","burgerking.com","https://burgerking.com/","20","2","fake"
"2017-05-23","String","lo rem ipsum","dolor sit amet","wendys.com","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","buggagump.com","","-","-","-"
"2017-05-23","String","cheese","ad eum","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","burger","ludus dissentiet","www.mcdonalds.com","https://www.mcdonalds.com/","25","3","fake"
"2017-05-23","String","wine","id erat utamur","bubbagump.com","https://buggagump.com/","25","3","fake"
建筑设置:
CsvParserSettings settings = new CsvParserSettings();
settings.setDelimiterDetectionEnabled(true);
settings.setQuoteDetectionEnabled(true);
settings.setLineSeparatorDetectionEnabled(false); // all the same using `true`
settings.getFormat().setLineSeparator("\r\n");
CsvParser parser = new CsvParser(settings);
List<String[]> rows;
rows = parser.parseAll(getReader("testFiles/" + "malformed csv small.csv"));
for (String[] row : rows)
{
System.out.println("");
int i = 0;
for (String element : row)
{
System.out.println("col. " + i++ + ": " + element);
}
}
System.out.println("done");
由于您正在测试自动检测过程,我建议您打印出检测到的格式:
CsvFormat format = parser.getDetectedFormat();
System.out.println(format);
这将打印出:
CsvFormat:
Comment character=#
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\r\n
Quote character="
Quote escape character=-
Quote escape escape character=null
如您所见,解析器未正确检测引号转义。虽然格式检测过程通常非常好,但不能保证它总是正确的,特别是对于小测试样本。在您的样本中,我不明白为什么它会选择 -
作为转义字符,所以我打开这个 issue 来调查并查看是什么让它检测到那个字符。
如果您知道输入文件中的 none 永远不会有 -
作为引号转义,您现在可以做的解决方法是检测格式,测试它从输入中获取的内容,然后解析内容,如下所示:
public List<String[]> parse(File input, CsvFormat format) {
CsvParserSettings settings = new CsvParserSettings();
if (format == null) { //no format specified? Let's detect what we are dealing with
settings.detectFormatAutomatically();
CsvParser parser = new CsvParser(settings);
parser.beginParsing(input); //just call begin parsing to kick of the auto-detection process
format = parser.getDetectedFormat(); //capture the format
parser.stopParsing(); //stop the parser - no need to read anything yet.
System.out.println(format);
if (format.getQuoteEscape() == '-') { //got something weird detected? Let's amend it.
format.setQuoteEscape('"');
}
return parse(input, format); //now parse with the intended format
} else {
settings.setFormat(format); //this parses with the format adjusted earlier.
CsvParser parser = new CsvParser(settings);
return parser.parseAll(input);
}
}
现在只需调用 parse
方法:
List<String[]> rows = parse(new File("/Users/jbax/Downloads/malformed csv r n small.csv"), null);
您将正确提取数据。希望这对您有所帮助!