根据字符串从 CSV 文件中删除重复行 - JAVA

Remove duplicate row from CSV file based on a string - JAVA

我最近在 TripAdvisor 上抓取了一些评论数据,目前有一个具有以下结构的数据集。

Organization,Address,Reviewer,Review Title,Review,Review Count,Help Count,Attraction Count,Restaurant Count,Hotel Count,Location,Rating Date,Rating

Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth  is open only twice a day and so it's best to check these timings ...   More,89,48,7,0,0,Vientiane,2 days ago,3

Temple of the Tooth (Sri Dalada Maligawa),Address: Sri Dalada Veediya Kandy 20000 Sri Lanka,WowLao,Temple tour,Visits to places of worship always bring home to me the power of superstition. The Temple of the Tooth was no exception. But I couldn't help but marvel at the fervor with which some devotees were praying. One tip though: the shrine that houses the Tooth  is open only twice a day and so it's best to check these timings  though I would imagine that the crowds would be at a peak.,89,48,7,0,0,Vientiane,2 days ago,3

如您所见,对象的第一行是部分评论,而第二行是完整评论。

我想实现的是像这样检查重复项,并删除具有部分评论的对象(行),并保留具有完整评论的行。

我看到每个部分评论最后都以 'More' 结尾,这可以以某种方式用来过滤掉部分评论吗?

我如何使用 OpenCSV 解决这个问题?

下面的怎么样:

 HashMap<String, String[]> preferredReviews = new HashMap<>();
 int indexOfReview = 4;
 CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
 String [] nextLine;
 while ((nextLine = reader.readNext()) != null) {
     String reviewId = nextLine[0];
     String[] prevReview = preferredReviews.get(reviewId);
     if (prevReview == null || prevReview[indexOfReview].length < nextLine[indexOfReview].length) {
         preferredReviews.put(reviewId, nextLine);
     }
 }

在 IF 语句的第二个子句中,它进行长度比较以决定使用哪个。我喜欢这种方法的一点是,如果由于某种原因没有完整的评论,那么至少你会得到简短的评论。

但它可以更改为检查“...更多”而不是评论长度。

 HashMap<String, String[]> preferredReviews = new HashMap<>();
 int indexOfReview = 4;
 CSVReader reader = new CSVReader(new FileReader("reviews.csv"));
 String [] nextLine;
 while ((nextLine = reader.readNext()) != null) {
     String reviewId = nextLine[0];
     if (nextLine[indexOfReview].endsWith("... More")){
         preferredReviews.put(reviewId, nextLine);
     }       
 }

注意:未经明确许可不得将其他网络服务的数据用于商业用途。

话虽如此: 基本上,openCSV 将为您提供数组枚举。数组是你的行。

您需要将您的行复制到其他一些更具语义的数据结构中。从您的 header 行来看,我会创建一个这样的 bean。

public class TravelRow {
   String organization;
   String address;
   String reviewer;
   String reviewTitle;
   String review; // you get it... 

   public TravelRow(String[] row) {
       // assign row-index to property
       this.organization = row[0];
       // you get it ...
   }
}

您可能希望为其生成 getXXXsetXXX 函数。

现在您需要为该行找到一个主键,我建议它是organisation。 遍历行,为它创建一个 bean,将它添加到具有关键组织的哈希图中。

如果该组织已在哈希图中,则将当前评论与已存储的评论进行比较。如果新评论更长或已存储的评论以 ... more 结尾,则替换地图中的 object。

遍历所有行后,您会得到一个包含您想要的评论的Map

Map<TravelRow> result = new HashMap<TravelRow>();
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
   // nextLine[] is an array of values from the line
   if( result.containsKey(nextLine[0]) ) {
       // compare the review
       if( reviewNeedsUpdate(result.get(nextLine[0]), nextLine[4]) ) {
           result.get(nextLine[0]).setReview(nextLine[4]); // update only the review, create a new object, if you like
       }
   }
   else {
       // create TravelRow with array using the constructor eating the line
       result.put(nextLine[0], new TravelRow(nextLine));
   }
}

reviewNeedsUpdate(TravelRow row, String review) 将比较 reviewrow.review 和 return true,如果新评论更好。您可以扩展此功能,直到它满足您的需求....

private boolean reviewNeedsUpdate( TravelRow row, String review ) {
    return ( row.review.endsWith("more") && !review.endsWith("more") ); 
}

比如说,你定义class Rating来存储相关数据。

class Rating {
  public String review;  // consider using getters/setters instead of public fields

  Rating(String review) {
    this.review = review;
  }
}

读取 CSV 的内容。

Set<Rating> readCSV() {
  List<String[]> csv = new CSVReader(new FileReader("reviews.csv")).readAll();
  List<Rating> ratings = csv.stream()
      .map(row -> new Rating(row[4])) // add the other attributes
      .collect(Collectors.toList());
  return mergeRatings(ratings);
}

我们将使用 TreeSet 来整理重复项。这需要一个自定义比较器来丢弃集合中已有的项目。

class RatingMergerComparator implements Comparator<Rating> {

  @Override
  public int compare(Rating rating1, Rating rating2) {
    if (rating1.review.startsWith(rating2.review) ||
        rating2.review.startsWith(rating1.review)) { 
      return 0;
    }
    return rating1.review.compareTo(rating2.review);
  }
}

创建mergeRatings方法

void removeMoreEndings(List<Ratings> ratings) {
  for (Rating rating : ratings) {
    if (rating.review.endsWith("...   More")) {
      rating.review = rating.review.substring(0, rating.review.length() - 9); // 9 = length of "...  More"
    }
  }
}

Set<Rating> mergeRatings(List<Rating> ratings) {
  removeMoreEndings(ratings); // remove all "...  More" endings
  // sort ratings by length in a descending order, since the set will discard certain items,
  // it is important to keep the longer ones, so they come first
  ratings.sort(Comparator.comparing((Rating rating) -> rating.review.length()).reversed());
  TreeSet<Rating> mergedRatings = new TreeSet<>(new RatingMergerComparator());
  mergedRatings.addAll(ratings);
  return mergedRatings;
}

更新

我可能看错了OP。即使必须合并的记录在 CSV 中距离较远,上述解决方案也能提供非常好的性能。如果你确定,partial a full reviews是连续的,以上可能有点大材小用了。

这取决于您读取数据的方式。

如果您使用 MappingStategy 将数据作为 Bean 读取,您可以使用 CSVFilter 接口创建自己的过滤器并将其注入 CsvToBean class。这会导致根据 allowedLine 方法中的条件读取(允许)或跳过一行。 CSVFilter 的 java 文档提供了一个很好的示例 - 对于您的情况,您将允许其 Review 列不以 "More" 结尾的所有行。

如果您只是使用 CSVReader/CSVParser 会有点棘手。您将需要阅读 header 并查看评论所在的列。然后在阅读每一行时,您将查看该索引处的元素,如果它以 "More" 结尾,则不要处理它。