根据字符串从 CSV 文件中删除重复行 - JAVA
Remove duplicate row from CSV file based on a string - JAVA
我最近在 TripAdvisor 上抓取了一些评论数据,目前有一个具有以下结构的数据集。
Organization,Address,Reviewer,Review Title,Review,Review Count,Help Count,Attraction Count,Restaurant Count,Hotel Count,Location,Rating Date,Rating
Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth is open only twice a day and so it's best to check these timings ... More,89,48,7,0,0,Vientiane,2 days ago,3
Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth is open only twice a day and so it's best to check these timings though I would imagine that the crowds would be at a peak.,89,48,7,0,0,Vientiane,2 days ago,3
如您所见,对象的第一行是部分评论,而第二行是完整评论。
我想实现的是像这样检查重复项,并删除具有部分评论的对象(行),并保留具有完整评论的行。
我看到每个部分评论最后都以 'More' 结尾,这可以以某种方式用来过滤掉部分评论吗?
我如何使用 OpenCSV 解决这个问题?
下面的怎么样:
HashMap<String, String[]> preferredReviews = new HashMap<>();
int indexOfReview = 4;
CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
String reviewId = nextLine[0];
String[] prevReview = preferredReviews.get(reviewId);
if (prevReview == null || prevReview[indexOfReview].length < nextLine[indexOfReview].length) {
preferredReviews.put(reviewId, nextLine);
}
}
在 IF 语句的第二个子句中,它进行长度比较以决定使用哪个。我喜欢这种方法的一点是,如果由于某种原因没有完整的评论,那么至少你会得到简短的评论。
但它可以更改为检查“...更多”而不是评论长度。
HashMap<String, String[]> preferredReviews = new HashMap<>();
int indexOfReview = 4;
CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
String reviewId = nextLine[0];
if (nextLine[indexOfReview].endsWith("... More")){
preferredReviews.put(reviewId, nextLine);
}
}
注意:未经明确许可不得将其他网络服务的数据用于商业用途。
话虽如此:
基本上,openCSV 将为您提供数组枚举。数组是你的行。
您需要将您的行复制到其他一些更具语义的数据结构中。从您的 header 行来看,我会创建一个这样的 bean。
public class TravelRow {
String organization;
String address;
String reviewer;
String reviewTitle;
String review; // you get it...
public TravelRow(String[] row) {
// assign row-index to property
this.organization = row[0];
// you get it ...
}
}
您可能希望为其生成 getXXX
和 setXXX
函数。
现在您需要为该行找到一个主键,我建议它是organisation
。
遍历行,为它创建一个 bean,将它添加到具有关键组织的哈希图中。
如果该组织已在哈希图中,则将当前评论与已存储的评论进行比较。如果新评论更长或已存储的评论以 ... more
结尾,则替换地图中的 object。
遍历所有行后,您会得到一个包含您想要的评论的Map
。
Map<TravelRow> result = new HashMap<TravelRow>();
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
// nextLine[] is an array of values from the line
if( result.containsKey(nextLine[0]) ) {
// compare the review
if( reviewNeedsUpdate(result.get(nextLine[0]), nextLine[4]) ) {
result.get(nextLine[0]).setReview(nextLine[4]); // update only the review, create a new object, if you like
}
}
else {
// create TravelRow with array using the constructor eating the line
result.put(nextLine[0], new TravelRow(nextLine));
}
}
reviewNeedsUpdate(TravelRow row, String review)
将比较 review
与 row.review
和 return true
,如果新评论更好。您可以扩展此功能,直到它满足您的需求....
private boolean reviewNeedsUpdate( TravelRow row, String review ) {
return ( row.review.endsWith("more") && !review.endsWith("more") );
}
比如说,你定义class Rating
来存储相关数据。
class Rating {
public String review; // consider using getters/setters instead of public fields
Rating(String review) {
this.review = review;
}
}
读取 CSV 的内容。
Set<Rating> readCSV() {
List<String[]> csv = new CSVReader(new FileReader("reviews.csv")).readAll();
List<Rating> ratings = csv.stream()
.map(row -> new Rating(row[4])) // add the other attributes
.collect(Collectors.toList());
return mergeRatings(ratings);
}
我们将使用 TreeSet
来整理重复项。这需要一个自定义比较器来丢弃集合中已有的项目。
class RatingMergerComparator implements Comparator<Rating> {
@Override
public int compare(Rating rating1, Rating rating2) {
if (rating1.review.startsWith(rating2.review) ||
rating2.review.startsWith(rating1.review)) {
return 0;
}
return rating1.review.compareTo(rating2.review);
}
}
创建mergeRatings
方法
void removeMoreEndings(List<Ratings> ratings) {
for (Rating rating : ratings) {
if (rating.review.endsWith("... More")) {
rating.review = rating.review.substring(0, rating.review.length() - 9); // 9 = length of "... More"
}
}
}
Set<Rating> mergeRatings(List<Rating> ratings) {
removeMoreEndings(ratings); // remove all "... More" endings
// sort ratings by length in a descending order, since the set will discard certain items,
// it is important to keep the longer ones, so they come first
ratings.sort(Comparator.comparing((Rating rating) -> rating.review.length()).reversed());
TreeSet<Rating> mergedRatings = new TreeSet<>(new RatingMergerComparator());
mergedRatings.addAll(ratings);
return mergedRatings;
}
更新
我可能看错了OP。即使必须合并的记录在 CSV 中距离较远,上述解决方案也能提供非常好的性能。如果你确定,partial a full reviews是连续的,以上可能有点大材小用了。
这取决于您读取数据的方式。
如果您使用 MappingStategy 将数据作为 Bean 读取,您可以使用 CSVFilter 接口创建自己的过滤器并将其注入 CsvToBean class。这会导致根据 allowedLine 方法中的条件读取(允许)或跳过一行。 CSVFilter 的 java 文档提供了一个很好的示例 - 对于您的情况,您将允许其 Review 列不以 "More" 结尾的所有行。
如果您只是使用 CSVReader/CSVParser 会有点棘手。您将需要阅读 header 并查看评论所在的列。然后在阅读每一行时,您将查看该索引处的元素,如果它以 "More" 结尾,则不要处理它。
我最近在 TripAdvisor 上抓取了一些评论数据,目前有一个具有以下结构的数据集。
Organization,Address,Reviewer,Review Title,Review,Review Count,Help Count,Attraction Count,Restaurant Count,Hotel Count,Location,Rating Date,Rating
Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth is open only twice a day and so it's best to check these timings ... More,89,48,7,0,0,Vientiane,2 days ago,3
Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth is open only twice a day and so it's best to check these timings though I would imagine that the crowds would be at a peak.,89,48,7,0,0,Vientiane,2 days ago,3
如您所见,对象的第一行是部分评论,而第二行是完整评论。
我想实现的是像这样检查重复项,并删除具有部分评论的对象(行),并保留具有完整评论的行。
我看到每个部分评论最后都以 'More' 结尾,这可以以某种方式用来过滤掉部分评论吗?
我如何使用 OpenCSV 解决这个问题?
下面的怎么样:
HashMap<String, String[]> preferredReviews = new HashMap<>();
int indexOfReview = 4;
CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
String reviewId = nextLine[0];
String[] prevReview = preferredReviews.get(reviewId);
if (prevReview == null || prevReview[indexOfReview].length < nextLine[indexOfReview].length) {
preferredReviews.put(reviewId, nextLine);
}
}
在 IF 语句的第二个子句中,它进行长度比较以决定使用哪个。我喜欢这种方法的一点是,如果由于某种原因没有完整的评论,那么至少你会得到简短的评论。
但它可以更改为检查“...更多”而不是评论长度。
HashMap<String, String[]> preferredReviews = new HashMap<>();
int indexOfReview = 4;
CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
String reviewId = nextLine[0];
if (nextLine[indexOfReview].endsWith("... More")){
preferredReviews.put(reviewId, nextLine);
}
}
注意:未经明确许可不得将其他网络服务的数据用于商业用途。
话虽如此: 基本上,openCSV 将为您提供数组枚举。数组是你的行。
您需要将您的行复制到其他一些更具语义的数据结构中。从您的 header 行来看,我会创建一个这样的 bean。
public class TravelRow {
String organization;
String address;
String reviewer;
String reviewTitle;
String review; // you get it...
public TravelRow(String[] row) {
// assign row-index to property
this.organization = row[0];
// you get it ...
}
}
您可能希望为其生成 getXXX
和 setXXX
函数。
现在您需要为该行找到一个主键,我建议它是organisation
。
遍历行,为它创建一个 bean,将它添加到具有关键组织的哈希图中。
如果该组织已在哈希图中,则将当前评论与已存储的评论进行比较。如果新评论更长或已存储的评论以 ... more
结尾,则替换地图中的 object。
遍历所有行后,您会得到一个包含您想要的评论的Map
。
Map<TravelRow> result = new HashMap<TravelRow>();
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
// nextLine[] is an array of values from the line
if( result.containsKey(nextLine[0]) ) {
// compare the review
if( reviewNeedsUpdate(result.get(nextLine[0]), nextLine[4]) ) {
result.get(nextLine[0]).setReview(nextLine[4]); // update only the review, create a new object, if you like
}
}
else {
// create TravelRow with array using the constructor eating the line
result.put(nextLine[0], new TravelRow(nextLine));
}
}
reviewNeedsUpdate(TravelRow row, String review)
将比较 review
与 row.review
和 return true
,如果新评论更好。您可以扩展此功能,直到它满足您的需求....
private boolean reviewNeedsUpdate( TravelRow row, String review ) {
return ( row.review.endsWith("more") && !review.endsWith("more") );
}
比如说,你定义class Rating
来存储相关数据。
class Rating {
public String review; // consider using getters/setters instead of public fields
Rating(String review) {
this.review = review;
}
}
读取 CSV 的内容。
Set<Rating> readCSV() {
List<String[]> csv = new CSVReader(new FileReader("reviews.csv")).readAll();
List<Rating> ratings = csv.stream()
.map(row -> new Rating(row[4])) // add the other attributes
.collect(Collectors.toList());
return mergeRatings(ratings);
}
我们将使用 TreeSet
来整理重复项。这需要一个自定义比较器来丢弃集合中已有的项目。
class RatingMergerComparator implements Comparator<Rating> {
@Override
public int compare(Rating rating1, Rating rating2) {
if (rating1.review.startsWith(rating2.review) ||
rating2.review.startsWith(rating1.review)) {
return 0;
}
return rating1.review.compareTo(rating2.review);
}
}
创建mergeRatings
方法
void removeMoreEndings(List<Ratings> ratings) {
for (Rating rating : ratings) {
if (rating.review.endsWith("... More")) {
rating.review = rating.review.substring(0, rating.review.length() - 9); // 9 = length of "... More"
}
}
}
Set<Rating> mergeRatings(List<Rating> ratings) {
removeMoreEndings(ratings); // remove all "... More" endings
// sort ratings by length in a descending order, since the set will discard certain items,
// it is important to keep the longer ones, so they come first
ratings.sort(Comparator.comparing((Rating rating) -> rating.review.length()).reversed());
TreeSet<Rating> mergedRatings = new TreeSet<>(new RatingMergerComparator());
mergedRatings.addAll(ratings);
return mergedRatings;
}
更新
我可能看错了OP。即使必须合并的记录在 CSV 中距离较远,上述解决方案也能提供非常好的性能。如果你确定,partial a full reviews是连续的,以上可能有点大材小用了。
这取决于您读取数据的方式。
如果您使用 MappingStategy 将数据作为 Bean 读取,您可以使用 CSVFilter 接口创建自己的过滤器并将其注入 CsvToBean class。这会导致根据 allowedLine 方法中的条件读取(允许)或跳过一行。 CSVFilter 的 java 文档提供了一个很好的示例 - 对于您的情况,您将允许其 Review 列不以 "More" 结尾的所有行。
如果您只是使用 CSVReader/CSVParser 会有点棘手。您将需要阅读 header 并查看评论所在的列。然后在阅读每一行时,您将查看该索引处的元素,如果它以 "More" 结尾,则不要处理它。