@Validate 在 UniVocity 解析器中与 CsvRoutines 一起使用时不跳过无效行
@Validate not skipping invalid rows when used with CsvRoutines in UniVocity parser
我正在使用 Univocity 解析器版本 2.7.3。我有一个 CSV 文件,其中包含 100 万条记录,并且将来可能会增长。我只阅读文件中的几个特定列,以下是我的要求:
任何时候都不要将 CSV 内容存储到内存中
Ignore/skip 如果是纬度或经度列,则创建 bean
在 CSV 中是 null/blank
为了满足这些要求,我尝试实现 CsvRoutines,这样 CSV 数据就不会复制到内存中。我在 "Latitude" 和 "Longitude" 字段上都使用了 @Validate 注释,并使用错误处理程序不返回任何异常,以便在验证失败时跳过记录。
CSV 样本:
#version:1.0
#timestamp:2017-05-29T23:22:22.320Z
#brand:test report
network_name,location_name,location_category,location_address,location_zipcode,location_phone_number,location_latitude,location_longitude,location_city,location_state_name,location_state_abbreviation,location_country,location_country_code,pricing_type,wep_key
"1 Free WiFi","Test Restaurant","Cafe / Restaurant","Marktplatz 18","1233","+41 263 34 05","1212.15","7.51","Basel","test","BE","India","DE","premium",""
"2 Free WiFi","Test Restaurant","Cafe / Restaurant","Zufikerstrasse 1","1111","+41 631 60 00","11.354","8.12","Bremgarten","test","AG","China","CH","premium",""
"3 Free WiFi","Test Restaurant","Cafe / Restaurant","Chemin de la Fontaine 10","1260","+41 22 361 69","12.34","11.23","Nyon","Vaud","VD","Switzerland","CH","premium",""
"!.oist*~","HoistGroup Office","Office","Chemin de I Etang","CH-1211","","","","test","test","GE","Switzerland","CH","premium",""
"test","tess's Takashiro","Cafe / Restaurant","Test 1-10","870-01","097-55-1808","","","Oita","Oita","OITA","Japan","JP","premium","1234B"
TestDTO.java
@Data
@NoArgsConstructor
@AllArgsConstructor
@JsonIgnoreProperties(ignoreUnknown = true)
public class TestDTO implements Serializable {
@Parsed(field = "location_name")
private String name;
@Parsed(field = "location_address")
private String addressLine1;
@Parsed(field = "location_city")
private String city;
@Parsed(field = "location_state_abbreviation")
private String state;
@Parsed(field = "location_country_code")
private String country;
@Parsed(field = "location_zipcode")
private String postalCode;
@Parsed(field = "location_latitude")
@Validate
private Double latitude;
@Parsed(field = "location_longitude")
@Validate
private Double longitude;
@Parsed(field = "network_name")
private String ssid;
}
Main.java
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.detectFormatAutomatically();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(true);
parserSettings.setSkipEmptyLines(true);
parserSettings.selectFields("network_name", "location_name","location_address", "location_zipcode",
"location_latitude", "location_longitude", "location_city","location_state_abbreviation", "location_country_code");
parserSettings.setProcessorErrorHandler(new RowProcessorErrorHandler() {
@Override
public void handleError(DataProcessingException error, Object[] inputRow, ParsingContext context) {
//do nothing
}
});
CsvRoutines parser = new CsvRoutines(parserSettings);
ResultIterator<TestDTO, ParsingContext> iterator = parser.iterate(TestDTO.class, new FileReader("c:\users\...\test.csv")).iterator();
int i=0;
while(iterator.hasNext()) {
TestDTO dto = iterator.next();
if(dto.getLongitude() == null || dto.getLatitude() == null)
i++;
}
System.out.println("count=="+i);
问题:
我实际上希望计数为零,因为我添加了错误处理程序并且没有抛回数据验证异常,但似乎并非如此。我认为@Validate 在遇到 Latitude 或 Longitude 为 null 的记录时会抛出异常(同一记录中的两个列也可能为 null),并在错误处理程序中处理 ignored/skipped。
基本上我不希望 UniVocity 在堆中创建和映射不必要的 DTO 对象(并导致内存不足),因为传入的 CSV 文件可能有超过 200 或 300k 的记录 Latitude/Longitude为空。
我什至尝试在@Validate 中添加自定义验证器,但没有成功。
有人可以告诉我我在这里遗漏了什么吗?
这里是图书馆的作者。你做的一切都是对的。这是一个错误,我刚刚打开这个问题 here 今天要解决。
当您 select 字段时出现错误:值的重新排序使得验证 运行 反对其他东西(在我的测试中,它验证了城市而不是纬度)。
对于您的情况,只需添加以下代码行即可正常工作:
parserSettings.setColumnReorderingEnabled(false);
这将使在未对字段进行 selected 的情况下生成带有空值的行,而不是删除空值并对已解析行中的值重新排序。它将避免错误并使您的程序 运行 稍微快一些。
您还需要在迭代位中测试 null
:
TestDTO dto = iterator.next();
if(dto != null) { // dto may come null here due to validation
if (dto.longitude == null || dto.latitude == null)
i++;
}
}
希望这对您有所帮助,感谢您使用我们的解析器!
我正在使用 Univocity 解析器版本 2.7.3。我有一个 CSV 文件,其中包含 100 万条记录,并且将来可能会增长。我只阅读文件中的几个特定列,以下是我的要求:
任何时候都不要将 CSV 内容存储到内存中
Ignore/skip 如果是纬度或经度列,则创建 bean 在 CSV 中是 null/blank
为了满足这些要求,我尝试实现 CsvRoutines,这样 CSV 数据就不会复制到内存中。我在 "Latitude" 和 "Longitude" 字段上都使用了 @Validate 注释,并使用错误处理程序不返回任何异常,以便在验证失败时跳过记录。
CSV 样本:
#version:1.0
#timestamp:2017-05-29T23:22:22.320Z
#brand:test report
network_name,location_name,location_category,location_address,location_zipcode,location_phone_number,location_latitude,location_longitude,location_city,location_state_name,location_state_abbreviation,location_country,location_country_code,pricing_type,wep_key
"1 Free WiFi","Test Restaurant","Cafe / Restaurant","Marktplatz 18","1233","+41 263 34 05","1212.15","7.51","Basel","test","BE","India","DE","premium",""
"2 Free WiFi","Test Restaurant","Cafe / Restaurant","Zufikerstrasse 1","1111","+41 631 60 00","11.354","8.12","Bremgarten","test","AG","China","CH","premium",""
"3 Free WiFi","Test Restaurant","Cafe / Restaurant","Chemin de la Fontaine 10","1260","+41 22 361 69","12.34","11.23","Nyon","Vaud","VD","Switzerland","CH","premium",""
"!.oist*~","HoistGroup Office","Office","Chemin de I Etang","CH-1211","","","","test","test","GE","Switzerland","CH","premium",""
"test","tess's Takashiro","Cafe / Restaurant","Test 1-10","870-01","097-55-1808","","","Oita","Oita","OITA","Japan","JP","premium","1234B"
TestDTO.java
@Data
@NoArgsConstructor
@AllArgsConstructor
@JsonIgnoreProperties(ignoreUnknown = true)
public class TestDTO implements Serializable {
@Parsed(field = "location_name")
private String name;
@Parsed(field = "location_address")
private String addressLine1;
@Parsed(field = "location_city")
private String city;
@Parsed(field = "location_state_abbreviation")
private String state;
@Parsed(field = "location_country_code")
private String country;
@Parsed(field = "location_zipcode")
private String postalCode;
@Parsed(field = "location_latitude")
@Validate
private Double latitude;
@Parsed(field = "location_longitude")
@Validate
private Double longitude;
@Parsed(field = "network_name")
private String ssid;
}
Main.java
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.detectFormatAutomatically();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(true);
parserSettings.setSkipEmptyLines(true);
parserSettings.selectFields("network_name", "location_name","location_address", "location_zipcode",
"location_latitude", "location_longitude", "location_city","location_state_abbreviation", "location_country_code");
parserSettings.setProcessorErrorHandler(new RowProcessorErrorHandler() {
@Override
public void handleError(DataProcessingException error, Object[] inputRow, ParsingContext context) {
//do nothing
}
});
CsvRoutines parser = new CsvRoutines(parserSettings);
ResultIterator<TestDTO, ParsingContext> iterator = parser.iterate(TestDTO.class, new FileReader("c:\users\...\test.csv")).iterator();
int i=0;
while(iterator.hasNext()) {
TestDTO dto = iterator.next();
if(dto.getLongitude() == null || dto.getLatitude() == null)
i++;
}
System.out.println("count=="+i);
问题:
我实际上希望计数为零,因为我添加了错误处理程序并且没有抛回数据验证异常,但似乎并非如此。我认为@Validate 在遇到 Latitude 或 Longitude 为 null 的记录时会抛出异常(同一记录中的两个列也可能为 null),并在错误处理程序中处理 ignored/skipped。
基本上我不希望 UniVocity 在堆中创建和映射不必要的 DTO 对象(并导致内存不足),因为传入的 CSV 文件可能有超过 200 或 300k 的记录 Latitude/Longitude为空。
我什至尝试在@Validate 中添加自定义验证器,但没有成功。
有人可以告诉我我在这里遗漏了什么吗?
这里是图书馆的作者。你做的一切都是对的。这是一个错误,我刚刚打开这个问题 here 今天要解决。
当您 select 字段时出现错误:值的重新排序使得验证 运行 反对其他东西(在我的测试中,它验证了城市而不是纬度)。
对于您的情况,只需添加以下代码行即可正常工作:
parserSettings.setColumnReorderingEnabled(false);
这将使在未对字段进行 selected 的情况下生成带有空值的行,而不是删除空值并对已解析行中的值重新排序。它将避免错误并使您的程序 运行 稍微快一些。
您还需要在迭代位中测试 null
:
TestDTO dto = iterator.next();
if(dto != null) { // dto may come null here due to validation
if (dto.longitude == null || dto.latitude == null)
i++;
}
}
希望这对您有所帮助,感谢您使用我们的解析器!