词频计数Java 8
Word frequency count Java 8
如何统计Java8中List的词频?
List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");
结果必须是:
{ciao=2, hello=1, bye=2}
我想分享我找到的解决方案,因为起初我希望使用 map-and-reduce 方法,但它有点不同。
Map<String,Long> collect = wordsList.stream()
.collect( Collectors.groupingBy( Function.identity(), Collectors.counting() ));
或者对于整数值:
Map<String,Integer> collect = wordsList.stream()
.collect( Collectors.groupingBy( Function.identity(), Collectors.summingInt(e -> 1) ));
编辑
我添加了如何按值对地图进行排序:
LinkedHashMap<String, Long> countByWordSorted = collect.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.collect(Collectors.toMap(
Map.Entry::getKey,
Map.Entry::getValue,
(v1, v2) -> {
throw new IllegalStateException();
},
LinkedHashMap::new
));
(注意:请参阅下面的编辑)
作为 的替代方法,这里有一种并行计算字数的方法:
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class ParallelWordCount
{
public static void main(String[] args)
{
List<String> list = Arrays.asList(
"hello", "bye", "ciao", "bye", "ciao");
Map<String, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}
EDIT In response to the comment, I ran a small test with JMH, comparing the toConcurrentMap
and the groupingByConcurrent
approach, with different input list sizes and random words of different lengths. This test suggested that the toConcurrentMap
approach was faster. When considering how different these approaches are "under the hood", it's hard to predict something like this.
As a further extension, based on further comments, I extended the test to cover all four combinations of toMap
, groupingBy
, serial and parallel.
The results are still that the toMap
approach is faster, but unexpectedly (at least, for me) the "concurrent" versions in both cases are slower than the serial versions...:
(method) (count) (wordLength) Mode Cnt Score Error Units
toConcurrentMap 1000 2 avgt 50 146,636 ± 0,880 us/op
toConcurrentMap 1000 5 avgt 50 272,762 ± 1,232 us/op
toConcurrentMap 1000 10 avgt 50 271,121 ± 1,125 us/op
toMap 1000 2 avgt 50 44,396 ± 0,541 us/op
toMap 1000 5 avgt 50 46,938 ± 0,872 us/op
toMap 1000 10 avgt 50 46,180 ± 0,557 us/op
groupingBy 1000 2 avgt 50 46,797 ± 1,181 us/op
groupingBy 1000 5 avgt 50 68,992 ± 1,537 us/op
groupingBy 1000 10 avgt 50 68,636 ± 1,349 us/op
groupingByConcurrent 1000 2 avgt 50 231,458 ± 0,658 us/op
groupingByConcurrent 1000 5 avgt 50 438,975 ± 1,591 us/op
groupingByConcurrent 1000 10 avgt 50 437,765 ± 1,139 us/op
toConcurrentMap 10000 2 avgt 50 712,113 ± 6,340 us/op
toConcurrentMap 10000 5 avgt 50 1809,356 ± 9,344 us/op
toConcurrentMap 10000 10 avgt 50 1813,814 ± 16,190 us/op
toMap 10000 2 avgt 50 341,004 ± 16,074 us/op
toMap 10000 5 avgt 50 535,122 ± 24,674 us/op
toMap 10000 10 avgt 50 511,186 ± 3,444 us/op
groupingBy 10000 2 avgt 50 340,984 ± 6,235 us/op
groupingBy 10000 5 avgt 50 708,553 ± 6,369 us/op
groupingBy 10000 10 avgt 50 712,858 ± 10,248 us/op
groupingByConcurrent 10000 2 avgt 50 901,842 ± 8,685 us/op
groupingByConcurrent 10000 5 avgt 50 3762,478 ± 21,408 us/op
groupingByConcurrent 10000 10 avgt 50 3795,530 ± 32,096 us/op
我对JMH不是很熟悉,可能是我这里做错了,欢迎指正:
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;
@State(Scope.Thread)
public class ParallelWordCount
{
@Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"})
public String method;
@Param({"2", "5", "10"})
public int wordLength;
@Param({"1000", "10000" })
public int count;
private List<String> list;
@Setup
public void initList()
{
list = createRandomStrings(count, wordLength, new Random(0));
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testMethod(Blackhole bh)
{
if (method.equals("toMap"))
{
Map<String, Integer> counts =
list.stream().collect(
Collectors.toMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("toConcurrentMap"))
{
Map<String, Integer> counts =
list.parallelStream().collect(
Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("groupingBy"))
{
Map<String, Long> counts =
list.stream().collect(
Collectors.groupingBy(
Function.identity(), Collectors.<String>counting()));
bh.consume(counts);
}
else if (method.equals("groupingByConcurrent"))
{
Map<String, Long> counts =
list.parallelStream().collect(
Collectors.groupingByConcurrent(
Function.identity(), Collectors.<String> counting()));
bh.consume(counts);
}
}
private static String createRandomString(int length, Random random)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++)
{
int c = random.nextInt(26);
sb.append((char) (c + 'a'));
}
return sb.toString();
}
private static List<String> createRandomStrings(
int count, int length, Random random)
{
List<String> list = new ArrayList<String>(count);
for (int i = 0; i < count; i++)
{
list.add(createRandomString(length, random));
}
return list;
}
}
仅对于具有 10000 个元素和 2 个字母的单词的列表的连续情况,时间相似。
可能值得检查对于更大的列表大小,并发版本最终是否优于串行版本,但目前没有时间针对所有这些配置进行另一个详细的基准测试 运行。
如果你使用Eclipse Collections, you can just convert the List
to a Bag
.
Bag<String> words =
Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag();
Assert.assertEquals(2, words.occurrencesOf("ciao"));
Assert.assertEquals(1, words.occurrencesOf("hello"));
Assert.assertEquals(2, words.occurrencesOf("bye"));
您也可以直接使用 Bags
工厂 class 创建 Bag
。
Bag<String> words =
Bags.mutable.with("hello", "bye", "ciao", "bye", "ciao");
此代码适用于 Java 5+。
注意:我是 Eclipse Collections 的提交者
我将在此处展示我所做的解决方案(带分组的解决方案要好得多:))。
static private void test0(List<String> input) {
Set<String> set = input.stream()
.collect(Collectors.toSet());
set.stream()
.collect(Collectors.toMap(Function.identity(),
str -> Collections.frequency(input, str)));
}
只有我的 0.02 美元
我的另外 2 美分,给定一个数组:
import static java.util.stream.Collectors.*;
String[] str = {"hello", "bye", "ciao", "bye", "ciao"};
Map<String, Integer> collected
= Arrays.stream(str)
.collect(groupingBy(Function.identity(),
collectingAndThen(counting(), Long::intValue)));
这是一种使用地图函数创建频率地图的方法。
List<String> words = Stream.of("hello", "bye", "ciao", "bye", "ciao").collect(toList());
Map<String, Integer> frequencyMap = new HashMap<>();
words.forEach(word ->
frequencyMap.merge(word, 1, (v, newV) -> v + newV)
);
System.out.println(frequencyMap); // {ciao=2, hello=1, bye=2}
或者
words.forEach(word ->
frequencyMap.compute(word, (k, v) -> v != null ? v + 1 : 1)
);
使用泛型查找集合中出现次数最多的项目:
private <V> V findMostFrequentItem(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Functions.identity(), Collectors.counting()))
.entrySet()
.stream()
.max(Comparator.comparing(Entry::getValue))
.map(Entry::getKey)
.orElse(null);
}
计算项目频率:
private <V> Map<V, Long> findFrequencies(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
public class Main {
public static void main(String[] args) {
String testString ="qqwweerrttyyaaaaaasdfasafsdfadsfadsewfywqtedywqtdfewyfdweytfdywfdyrewfdyewrefdyewdyfwhxvsahxvfwytfx";
long java8Case2 = testString.codePoints().filter(ch -> ch =='a').count();
System.out.println(java8Case2);
ArrayList<Character> list = new ArrayList<Character>();
for (char c : testString.toCharArray()) {
list.add(c);
}
Map<Object, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}
您可以使用 Java 8 条流
Arrays.asList(s).stream()
.collect(Collectors.groupingBy(Function.<String>identity(),
Collectors.<String>counting()));
Word Count in Java 8
1.Split all the words (w -> w.split("\s+"))
2.Use Collectors.toMap() method to accumulates elements into a Map.
3.Define keyMapper function w -> w.toLowerCase() for toMap() method.
4.Define valueMapper function w -> 1 for toMap() method.
5.Define mergeFunction Integer::sum for toMap() method.
- Print the result in the form of keys and values(Map<String, Integer>)
String str = "java test string test java string";
List<String> wordsList = Stream.of(str).map(w->w.split("//s+")).flatMap(Arrays::stream).collect(Collectors.toList());
wordsList.stream().collect(Collectors.toMap(w->w.toLowerCase(), w->1, Integer::sum)).entrySet().stream().forEach(stringIntegerEntry -> {
System.out.println(stringIntegerEntry.getKey() +" : "+ stringIntegerEntry.getValue());
});
如何统计Java8中List的词频?
List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");
结果必须是:
{ciao=2, hello=1, bye=2}
我想分享我找到的解决方案,因为起初我希望使用 map-and-reduce 方法,但它有点不同。
Map<String,Long> collect = wordsList.stream()
.collect( Collectors.groupingBy( Function.identity(), Collectors.counting() ));
或者对于整数值:
Map<String,Integer> collect = wordsList.stream()
.collect( Collectors.groupingBy( Function.identity(), Collectors.summingInt(e -> 1) ));
编辑
我添加了如何按值对地图进行排序:
LinkedHashMap<String, Long> countByWordSorted = collect.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.collect(Collectors.toMap(
Map.Entry::getKey,
Map.Entry::getValue,
(v1, v2) -> {
throw new IllegalStateException();
},
LinkedHashMap::new
));
(注意:请参阅下面的编辑)
作为
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class ParallelWordCount
{
public static void main(String[] args)
{
List<String> list = Arrays.asList(
"hello", "bye", "ciao", "bye", "ciao");
Map<String, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}
EDIT In response to the comment, I ran a small test with JMH, comparing the
toConcurrentMap
and thegroupingByConcurrent
approach, with different input list sizes and random words of different lengths. This test suggested that thetoConcurrentMap
approach was faster. When considering how different these approaches are "under the hood", it's hard to predict something like this.As a further extension, based on further comments, I extended the test to cover all four combinations of
toMap
,groupingBy
, serial and parallel.The results are still that the
toMap
approach is faster, but unexpectedly (at least, for me) the "concurrent" versions in both cases are slower than the serial versions...:
(method) (count) (wordLength) Mode Cnt Score Error Units
toConcurrentMap 1000 2 avgt 50 146,636 ± 0,880 us/op
toConcurrentMap 1000 5 avgt 50 272,762 ± 1,232 us/op
toConcurrentMap 1000 10 avgt 50 271,121 ± 1,125 us/op
toMap 1000 2 avgt 50 44,396 ± 0,541 us/op
toMap 1000 5 avgt 50 46,938 ± 0,872 us/op
toMap 1000 10 avgt 50 46,180 ± 0,557 us/op
groupingBy 1000 2 avgt 50 46,797 ± 1,181 us/op
groupingBy 1000 5 avgt 50 68,992 ± 1,537 us/op
groupingBy 1000 10 avgt 50 68,636 ± 1,349 us/op
groupingByConcurrent 1000 2 avgt 50 231,458 ± 0,658 us/op
groupingByConcurrent 1000 5 avgt 50 438,975 ± 1,591 us/op
groupingByConcurrent 1000 10 avgt 50 437,765 ± 1,139 us/op
toConcurrentMap 10000 2 avgt 50 712,113 ± 6,340 us/op
toConcurrentMap 10000 5 avgt 50 1809,356 ± 9,344 us/op
toConcurrentMap 10000 10 avgt 50 1813,814 ± 16,190 us/op
toMap 10000 2 avgt 50 341,004 ± 16,074 us/op
toMap 10000 5 avgt 50 535,122 ± 24,674 us/op
toMap 10000 10 avgt 50 511,186 ± 3,444 us/op
groupingBy 10000 2 avgt 50 340,984 ± 6,235 us/op
groupingBy 10000 5 avgt 50 708,553 ± 6,369 us/op
groupingBy 10000 10 avgt 50 712,858 ± 10,248 us/op
groupingByConcurrent 10000 2 avgt 50 901,842 ± 8,685 us/op
groupingByConcurrent 10000 5 avgt 50 3762,478 ± 21,408 us/op
groupingByConcurrent 10000 10 avgt 50 3795,530 ± 32,096 us/op
我对JMH不是很熟悉,可能是我这里做错了,欢迎指正:
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;
@State(Scope.Thread)
public class ParallelWordCount
{
@Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"})
public String method;
@Param({"2", "5", "10"})
public int wordLength;
@Param({"1000", "10000" })
public int count;
private List<String> list;
@Setup
public void initList()
{
list = createRandomStrings(count, wordLength, new Random(0));
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testMethod(Blackhole bh)
{
if (method.equals("toMap"))
{
Map<String, Integer> counts =
list.stream().collect(
Collectors.toMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("toConcurrentMap"))
{
Map<String, Integer> counts =
list.parallelStream().collect(
Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("groupingBy"))
{
Map<String, Long> counts =
list.stream().collect(
Collectors.groupingBy(
Function.identity(), Collectors.<String>counting()));
bh.consume(counts);
}
else if (method.equals("groupingByConcurrent"))
{
Map<String, Long> counts =
list.parallelStream().collect(
Collectors.groupingByConcurrent(
Function.identity(), Collectors.<String> counting()));
bh.consume(counts);
}
}
private static String createRandomString(int length, Random random)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++)
{
int c = random.nextInt(26);
sb.append((char) (c + 'a'));
}
return sb.toString();
}
private static List<String> createRandomStrings(
int count, int length, Random random)
{
List<String> list = new ArrayList<String>(count);
for (int i = 0; i < count; i++)
{
list.add(createRandomString(length, random));
}
return list;
}
}
仅对于具有 10000 个元素和 2 个字母的单词的列表的连续情况,时间相似。
可能值得检查对于更大的列表大小,并发版本最终是否优于串行版本,但目前没有时间针对所有这些配置进行另一个详细的基准测试 运行。
如果你使用Eclipse Collections, you can just convert the List
to a Bag
.
Bag<String> words =
Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag();
Assert.assertEquals(2, words.occurrencesOf("ciao"));
Assert.assertEquals(1, words.occurrencesOf("hello"));
Assert.assertEquals(2, words.occurrencesOf("bye"));
您也可以直接使用 Bags
工厂 class 创建 Bag
。
Bag<String> words =
Bags.mutable.with("hello", "bye", "ciao", "bye", "ciao");
此代码适用于 Java 5+。
注意:我是 Eclipse Collections 的提交者
我将在此处展示我所做的解决方案(带分组的解决方案要好得多:))。
static private void test0(List<String> input) {
Set<String> set = input.stream()
.collect(Collectors.toSet());
set.stream()
.collect(Collectors.toMap(Function.identity(),
str -> Collections.frequency(input, str)));
}
只有我的 0.02 美元
我的另外 2 美分,给定一个数组:
import static java.util.stream.Collectors.*;
String[] str = {"hello", "bye", "ciao", "bye", "ciao"};
Map<String, Integer> collected
= Arrays.stream(str)
.collect(groupingBy(Function.identity(),
collectingAndThen(counting(), Long::intValue)));
这是一种使用地图函数创建频率地图的方法。
List<String> words = Stream.of("hello", "bye", "ciao", "bye", "ciao").collect(toList());
Map<String, Integer> frequencyMap = new HashMap<>();
words.forEach(word ->
frequencyMap.merge(word, 1, (v, newV) -> v + newV)
);
System.out.println(frequencyMap); // {ciao=2, hello=1, bye=2}
或者
words.forEach(word ->
frequencyMap.compute(word, (k, v) -> v != null ? v + 1 : 1)
);
使用泛型查找集合中出现次数最多的项目:
private <V> V findMostFrequentItem(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Functions.identity(), Collectors.counting()))
.entrySet()
.stream()
.max(Comparator.comparing(Entry::getValue))
.map(Entry::getKey)
.orElse(null);
}
计算项目频率:
private <V> Map<V, Long> findFrequencies(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
public class Main {
public static void main(String[] args) {
String testString ="qqwweerrttyyaaaaaasdfasafsdfadsfadsewfywqtedywqtdfewyfdweytfdywfdyrewfdyewrefdyewdyfwhxvsahxvfwytfx";
long java8Case2 = testString.codePoints().filter(ch -> ch =='a').count();
System.out.println(java8Case2);
ArrayList<Character> list = new ArrayList<Character>();
for (char c : testString.toCharArray()) {
list.add(c);
}
Map<Object, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}
您可以使用 Java 8 条流
Arrays.asList(s).stream()
.collect(Collectors.groupingBy(Function.<String>identity(),
Collectors.<String>counting()));
Word Count in Java 8
1.Split all the words (w -> w.split("\s+"))
2.Use Collectors.toMap() method to accumulates elements into a Map.
3.Define keyMapper function w -> w.toLowerCase() for toMap() method.
4.Define valueMapper function w -> 1 for toMap() method.
5.Define mergeFunction Integer::sum for toMap() method.
- Print the result in the form of keys and values(Map<String, Integer>)
String str = "java test string test java string";
List<String> wordsList = Stream.of(str).map(w->w.split("//s+")).flatMap(Arrays::stream).collect(Collectors.toList());
wordsList.stream().collect(Collectors.toMap(w->w.toLowerCase(), w->1, Integer::sum)).entrySet().stream().forEach(stringIntegerEntry -> {
System.out.println(stringIntegerEntry.getKey() +" : "+ stringIntegerEntry.getValue());
});