如何快速将电影描述中的所有单词映射到描述中包含该单词的所有电影名称?

How can I map all words in the description of a Film to all films' names that contain the word in their description fast?

我有一个 List<Films> films,其中每部电影有 int idString description。 我的任务是将所有描述中的每个单词映射到描述中包含该单词的所有电影名称,它必须是这样的:

<word1>: <filmId11>, <filmId12>,..., <filmId1N>
<word2>: <filmId21>, <filmId22>, ..., <filmId2N>
...

我使用 Java 流 API:

private List<Map.Entry<String, String>> wordToFilmIds;

private void addWordsFromDescriptions(List<Film> films) {
        for (Film film : films) {
            String description = film.description();
            String[] tokens = description.split("[\p{IsPunctuation}\p{IsWhite_Space}]+");
            allWords.addAll(Arrays.stream(tokens).toList());
        }
        
    }

    private void mapWordsToFilmIDs(List<Films> films) {
        wordToFilmIds = allWords.stream()
                .map(word -> Map.entry(word,
                        films.stream()
                                .filter(film -> film.description().contains(word))
                                .map(film -> String.valueOf(film.id()))
                                .collect(Collectors.joining(","))))
                .toList();

        
    }

但问题是我的解决方案太慢而且我必须处理大数字,电影的数量约为 12 000 并且描述不短。另外,我 not 被允许使用 multi-threading。 知道如何优化它吗? 现在程序没有完成。

我也试过使用 parallel streams,但还是不行。

我认为您为每个单词迭代每部电影这一事实使得解决方案为 O(n^2)。不过一次迭代是可行的:

给定助手class:

public class Tuple<A,B> {
    public A a;
    public B b;
    public Tuple(A a, B b) {
        this.a = a;
        this.b = b;
    }
}

试试这个:

    Map<String, Set<Integer>> addWordsFromDescriptions(List<Film> films) {
        return films.stream()
                .flatMap(film -> tokenizeDescription(film).map(token -> new Tuple<>(token, film)))
                .collect(Collectors.groupingBy(
                        tuple -> tuple.a,
                        Collectors.mapping(tuple -> tuple.b.id(), Collectors.toSet())
                ));
    }

    private Stream<String> tokenizeDescription(Film film) {
        return Stream.of(film.description().split("[\p{IsPunctuation}\p{IsWhite_Space}]+"));
    }

鉴于Map<String, Set<Integer>>,您可以加入集合中的id并获得您想要的字符串。

public static class Film {

    private final String id;
    private final String description;

    public Film(String id, String description) {
        this.id = id;
        this.description = description;
    }

    public String getId() {
        return id;
    }

    public String getDescription() {
        return description;
    }

}

public static void main(String... args) {
    List<Film> films = List.of();
    Map<String, Film> filmById = films.stream().collect(Collectors.toMap(Film::getId, Function.identity()));
    Map<String, Set<String>> filmByLowerCaseDescriptionWord = createDescriptionMap(films);
}

private static Map<String, Set<String>> createDescriptionMap(List<Film> films) {
    Map<String, Set<String>> map = new HashMap<>();

    films.forEach(film -> Arrays.stream(film.getDescription().split("\S+"))
                                .map(word -> word.trim().toLowerCase(Locale.ROOT))
                                .forEach(lowerCaseWord ->
                                        map.computeIfAbsent(lowerCaseWord, key ->
                                                new HashSet<>()).add(film.getId())));

    return map;
}

有时,for-each 循环比 streams 更清晰和可读。

Java-8 在Map接口中引入了new方法; computeIfAbsent 很适合这种情况:

Map<String, Set<Integer>> wordsToFilmIdMap = new HashMap<>();

for (Film film : films) {
   String[] filmDescArray = film.getDesc().split("[\p{IsPunctuation}\p{IsWhite_Space}]+");
   for (String descWord : filmDescArray) {
        wordsToFilmIdMap.computeIfAbsent(descWord, unused -> new HashSet<>()).add(film.getId());
   }
}