在 groupingBy 期间为重复记录组中的字段分配唯一值

Question

根据 devReddit 提供的回复，我对以下测试文件（假数据）的 CSV 记录（相同的客户端名称）进行了分组：

CSV 测试文件

id,name,mother,birth,center
1,Antonio Carlos da Silva,Ana da Silva, 2008/03/31,1
2,Carlos Roberto de Souza,Amália Maria de Souza,2004/12/10,1
3,Pedro de Albuquerque,Maria de Albuquerque,2006/04/03,2
4,Danilo da Silva Cardoso,Sônia de Paula Cardoso,2002/08/10,3
5,Ralfo dos Santos Filho,Helena dos Santos,2012/02/21,4
6,Pedro de Albuquerque,Maria de Albuquerque,2006/04/03,2
7,Antonio Carlos da Silva,Ana da Silva, 2008/03/31,1
8,Ralfo dos Santos Filho,Helena dos Santos,2012/02/21,4
9,Rosana Pereira de Campos,Ivana Maria de Campos,2002/07/16,3
10,Paula Cristina de Abreu,Cristina Pereira de Abreu,2014/10/25,2
11,Pedro de Albuquerque,Maria de Albuquerque,2006/04/03,2
12,Ralfo dos Santos Filho,Helena dos Santos,2012/02/21,4

客户端实体

package entities;

public class Client {

    private String id;
    private String name;
    private String mother;
    private String birth;
    private String center;
    
    public Client() {
    }

    public Client(String id, String name, String mother, String birth, String center) {
        this.id = id;
        this.name = name;
        this.mother = mother;
        this.birth = birth;
        this.center = center;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getMother() {
        return mother;
    }

    public void setMother(String mother) {
        this.mother = mother;
    }

    public String getBirth() {
        return birth;
    }

    public void setBirth(String birth) {
        this.birth = birth;
    }

    public String getCenter() {
        return center;
    }

    public void setCenter(String center) {
        this.center = center;
    }
        
    @Override
    public String toString() {
        return "Client [id=" + id + ", name=" + name + ", mother=" + mother + ", birth=" + birth + ", center=" + center
                + "]";
    }
        
}

计划

package application;
    
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.function.Function;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
    
import entities.Client;
    
public class Program {
    
    public static void main(String[] args) throws IOException {
            
        Pattern pattern = Pattern.compile(",");
            
        List<Client> file = Files.lines(Paths.get("src/Client.csv"))  
            .skip(1)
            .map(line -> { 
                String[] fields = pattern.split(line);
                return new Client(fields[0], fields[1], fields[2], fields[3], fields[4]);
            })
            .collect(Collectors.toList()); 
                        
        Map<String, List<Client>> grouped = file
            .stream()
            .filter(x -> file.stream().anyMatch(y -> isDuplicate(x, y)))
            .collect(Collectors.toList())
            .stream()
            .collect(Collectors.groupingBy(p -> p.getCenter(), LinkedHashMap::new, Collectors.mapping(Function.identity(), Collectors.toList())));

        grouped.entrySet().forEach(System.out::println);    
    }
}

private static Boolean isDuplicate(Client x, Client y) {

    return !x.getId().equals(y.getId())
    && x.getName().equals(y.getName())
    && x.getMother().equals(y.getMother())
    && x.getBirth().equals(y.getBirth());    
}

最终结果（按中心分组）

1=[Client [id=1, name=Antonio Carlos da Silva, mother=Ana da Silva, birth= 2008/03/31, center=1],
    Client [id=7, name=Antonio Carlos da Silva, mother=Ana da Silva, birth= 2008/03/31, center=1]]
2=[Client [id=3, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
    Client [id=5, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2],
    Client [id=6, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
    Client [id=8, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2],
    Client [id=11, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
    Client [id=12, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2]]

我需要什么

我需要为每组重复的记录分配一个唯一的值，每次中心值更改都重新开始，甚至将记录保持在一起，因为地图不保证这一点，根据以下示例：

左边的数字显示按中心分组（1 和 2）。重复的名称具有相同的内组号并从“1”开始。当中心号码改变时，内组号码要从“1”重新开始，依此类推。

    1=[Client [group=1, id=1, name=Antonio Carlos da Silva, mother=Ana da Silva, birth= 2008/03/31, center=1],
       Client [group=1, id=7, name=Antonio Carlos da Silva, mother=Ana da Silva, birth= 2008/03/31, center=1]]

 // CENTER CHANGED (2) - Restart inner group number to "1" again.

    2=[Client [group=1, id=3, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
       Client [group=1, id=6, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
       Client [group=1, id=11, name=Pedro de Albuquerque, mother=Maria de Albuquerque, birth=2006/04/03, center=2],
 
// NAME CHANGED, BUT SAME CENTER YET - so increases by "1" (group=2)
      
Client [group=2, id=5, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2],
       Client [group=2, id=8, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2],
       Client [group=2, id=12, name=Ralfo dos Santos Filho, mother=Helena dos Santos, birth=2012/02/21, center=2]]

Answer 1

如果我理解得很好，您需要根据所有三个属性 name、mother 和 birth 对已经分组的条目进行排序。您可以在使用 groupingBy 收集之前应用这样的排序，使用 sorted:

 Map<String, List<Client>> grouped = file.stream()
                    .filter(x -> file.stream().anyMatch(y -> isDuplicate(x, y)))
                    .sorted(Comparator.comparing(Client::getName)
                                      .thenComparing(Client::getMother)
                                      .thenComparing(Client::getBirth))
                    .collect(Collectors.groupingBy(Client::getCenter));

Collectors.groupingBy 在内部使用 Collectors.toList() 作为其下游，因此它保留了您已经使用 sorted 定义的顺序；那么就不需要 LinkedHashMap。

更新： 要生成 groupId，您可以从 Client 实体生成它。以下是更新后的 Client:

package com.example.demo;

import java.util.Optional;

public class Client {

    private String id;
    private String name;
    private String mother;
    private String birth;
    private String center;
    private String groupId;

    public Client() {
    }

    public Client(String id, String name, String mother, String birth, String center) {
        this.id = id;
        this.name = name;
        this.mother = mother;
        this.birth = birth;
        this.center = center;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getMother() {
        return mother;
    }

    public void setMother(String mother) {
        this.mother = mother;
    }

    public String getBirth() {
        return birth;
    }

    public void setBirth(String birth) {
        this.birth = birth;
    }

    public String getCenter() {
        return center;
    }

    public void setCenter(String center) {
        this.center = center;
    }

    public Optional<String> getGroupId() {
        return Optional.ofNullable(groupId);
    }

    public void setGroupId(final String groupId) {
        this.groupId = groupId;
    }

    @Override
    public String toString() {
        return getGroupId().isPresent()
                ? "Client [groupId=" + groupId + ", id=" + id + ", name=" + name + ", mother=" + mother + ", birth=" + birth +
                ", center=" + center + "]"
                : "Client [id=" + id + ", name=" + name + ", mother=" + mother + ", birth=" + birth + ", center=" + center + "]";
    }
    
    ///
    /// Other public methods
    ///

    public Client generateAndAssignGroupId() {
        setGroupId(String.format("**group=%s**", center));
        return this;
    }
}

和新流：

Map<String, List<Client>> grouped = file.stream()
                .filter(x -> file.stream().anyMatch(y -> isDuplicate(x, y)))
                .sorted(Comparator.comparing(Client::getName).thenComparing(Client::getMother).thenComparing(Client::getBirth))
                .collect(Collectors.groupingBy(Client::getCenter,
                        Collectors.mapping(Client::generateAndAssignGroupId, Collectors.toList())));

Answer 2

不是在每个 filter 中使用 file.stream，您可以通过使用相关字段形成键来创建地图：

Client中的新方法class

public String getKey() {
    return String.format("%s~%s~%s~%s", id, name, mother, birth);
}

使用它创建一个以计数为值的地图。

Map<String, Long> countMap = 
    file.stream()
        .map(Client::getKey)
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

然后

// For each inner group you need a separate id based on the name.
// The input would be a map with client name as the key and the
// value would be the corresponding list of clients.
// The below function returns a new map with 
// integer as the key part (required unique id for each inner group).
Function<Map<String, List<Client>>, Map<Integer, List<Client>>> mapper
    = map -> {
        AtomicInteger i = new AtomicInteger(1);
        return map.entrySet().stream()
                  .collect(Collectors.toMap(e -> i.getAndIncrement(), Map.Entry::getValue);
    };

// assuming static import of "java.util.stream.Collectors"
Map<String, Map<Integer, List<Client>>> grouped = 
    file.stream()
        .filter(x -> countMap.get(x.getKey()) > 1L) // indicates duplicate
        .collect(groupingBy(Client::getCenter,    
                            collectingAndThen(groupingBy(Client::getName, toList()),
                                              mapper /* the above function*/ )));

Answer 3

该任务要求将CSV文件按中心分组，并在每组中按升序对名称进行排序。如果您尝试在 Java.

中执行，代码会很长

使用 open-source Java 包 SPL 很容易完成。一行代码就够了：

	A
1	=file("client.csv":"UTF-8").import@ct().sort(center,name).derive(ranki(name;center):group)

SPL 提供 JDBC 驱动程序供 Java 调用。只需将上面的 SPL 脚本存储为 dense_rank.splx 并在调用存储过程时在 Java 中调用它：

…
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
st=con.prepareCall("call dense_rank ()");
st.execute();
…

或者在执行 SQL 语句时在 Java 程序中执行 SPL 字符串：

…
st = con.prepareStatement("==file(\"client.csv\":\"UTF-8\")
     .import@ct().sort(center,name).derive(ranki(name;center):group)");
st.execute();
…

在 groupingBy 期间为重复记录组中的字段分配唯一值

Assign unique value to field in duplicate records group during groupingBy

java

csv

java-8

java-stream

groupingby