Java 8 的字符串重复数据删除功能

Question

由于 Java 中的 String（与其他语言一样）消耗大量内存，因为每个字符占用两个字节，Java 8 引入了一个名为 String Deduplication 它利用了 char 数组是字符串内部和 final 的事实，因此 JVM 可以乱用它们。

到目前为止，我已经阅读了 this example，但由于我不是专业的 java 编码人员，所以我很难理解这个概念。

内容如下，

Various strategies for String Duplication have been considered, but the one implemented now follows the following approach: Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char. If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.

This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.

我的第一个问题，

由于最近在 Java 8 update 20 中添加了该主题，因此仍然缺乏资源，这里的任何人都可以分享一些实际示例，说明它如何帮助减少 [=11] 消耗的内存=] 在 Java ?

编辑：

以上link表示，

As soon as it finds another String which has the same hash code it compares them char by char

我的第二个问题，

如果两个String的哈希码相同那么Strings就已经相同了，那为什么要用char和char比较，发现两个 String 具有相同的哈希码 ?

Answer 1

他们描述的策略是简单地在可能的许多 equal 个字符串中重用一个字符串的内部字符数组。如果每个 String 相等，则不需要每个 String 都有自己的副本。

为了更快地确定 2 个字符串是否相等，哈希码用作第一步，因为它是确定字符串可能是否相等的快速方法.因此他们的声明：

As soon as it finds another String which has the same hash code it compares them char by char

这是为了在使用哈希码确定可能相等性后，进行某些（但较慢）相等性比较。

最终，相等的字符串将共享一个底层字符数组。

Java 已经有 String.intern() 很长时间了，做或多或少相同的事情（即通过删除相同的字符串来节省内存）。这一点的新颖之处在于它发生在垃圾收集期间并且可以从外部控制。

Answer 2

假设您有一本 phone 本书，其中包含人物，其中有 String firstName 和 String lastName。碰巧在你的 phone 书中，有 100,000 人拥有相同的 firstName = "John"。

因为您从数据库或文件中获取数据，所以您的 JVM 内存中包含字符数组 {'J', 'o', 'h', 'n'} 10 万次，每个 John 字符串一个。例如，这些数组中的每一个都占用 20 字节的内存，因此那些 100k Johns 占用 2 MB 的内存。

通过重复数据删除，JVM 将意识到“John”被重复多次，并使所有这些 John 字符串指向相同的底层 char 数组，从而将内存使用量从 2MB 减少到 20 字节。

您可以在 JEP 中找到更详细的解释。特别是：

Many large-scale Java applications are currently bottlenecked on memory. Measurements have shown that roughly 25% of the Java heap live data set in these types of applications is consumed by String objects. Further, roughly half of those String objects are duplicates, where duplicates means string1.equals(string2) is true. Having duplicate String objects on the heap is, essentially, just a waste of memory.

[...]

The actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.

Answer 3

既然你的第一个问题已经回答了，我来回答你的第二个问题。

String 对象必须逐个字符进行比较，因为虽然相等 Objects 意味着相等的哈希值，但反之 not 必然为真。

如Holger said in his ，这表示哈希冲突。

hashcode()方法适用规范如下：

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. ...

也就是说，为了保证它们的相等性，需要对每一个字符进行比较，才能确认两个对象的相等性。他们首先比较 hashCode，而不是使用 equals，因为他们使用散列 table 作为引用，这提高了性能。

Answer 4

@assylias 的回答基本上告诉你它是如何工作的，是非常好的答案。我已经使用 String Deduplication 测试了一个生产应用程序并获得了一些结果。 Web 应用程序大量使用字符串，所以我认为优势非常明显。

要启用字符串重复数据删除，您必须添加这些 JVM 参数（您至少需要 Java 8u20）：

-XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics

最后一个是可选的，但正如其名称所示，它会向您显示字符串重复数据删除统计信息。这是我的：

[GC concurrent-string-deduplication, 2893.3K->2672.0B(2890.7K), avg 97.3%, 0.0175148 secs]
   [Last Exec: 0.0175148 secs, Idle: 3.2029081 secs, Blocked: 0/0.0000000 secs]
      [Inspected:           96613]
         [Skipped:              0(  0.0%)]
         [Hashed:           96598(100.0%)]
         [Known:                2(  0.0%)]
         [New:              96611(100.0%)   2893.3K]
      [Deduplicated:        96536( 99.9%)   2890.7K( 99.9%)]
         [Young:                0(  0.0%)      0.0B(  0.0%)]
         [Old:              96536(100.0%)   2890.7K(100.0%)]
   [Total Exec: 452/7.6109490 secs, Idle: 452/776.3032184 secs, Blocked: 11/0.0258406 secs]
      [Inspected:        27108398]
         [Skipped:              0(  0.0%)]
         [Hashed:        26828486( 99.0%)]
         [Known:            19025(  0.1%)]
         [New:           27089373( 99.9%)    823.9M]
      [Deduplicated:     26853964( 99.1%)    801.6M( 97.3%)]
         [Young:             4732(  0.0%)    171.3K(  0.0%)]
         [Old:           26849232(100.0%)    801.4M(100.0%)]
   [Table]
      [Memory Usage: 2834.7K]
      [Size: 65536, Min: 1024, Max: 16777216]
      [Entries: 98687, Load: 150.6%, Cached: 415, Added: 252375, Removed: 153688]
      [Resize Count: 6, Shrink Threshold: 43690(66.7%), Grow Threshold: 131072(200.0%)]
      [Rehash Count: 0, Rehash Threshold: 120, Hash Seed: 0x0]
      [Age Threshold: 3]
   [Queue]
      [Dropped: 0]

这些是运行应用程序运行 10 分钟后的结果。如您所见，字符串重复数据删除执行了 452 次和 "deduplicated" 801.6 MB 字符串。字符串重复数据删除检查了 27 000 000 个字符串。当我将内存消耗从 Java 7 与标准并行 GC 与 Java 8u20 与 G1 GC 并启用字符串重复数据删除进行比较时，堆下降了大约 50%：

Java 7 并行 GC

Java 8 G1 GC，带字符串去重

Java 8 的字符串重复数据删除功能

String Deduplication feature of Java 8

java

string

java-8