Java 11 中 String trim() 和 strip() 方法的区别

Difference between String trim() and strip() methods in Java 11

在其他变化中,JDK 11 为 java.lang.String class:

引入了 6 种新方法

特别是,strip() 看起来与 trim() 非常相似。根据 this article strip*() 方法旨在:

The String.strip(), String.stripLeading(), and String.stripTrailing() methods trim white space [as determined by Character.isWhiteSpace()] off either the front, back, or both front and back of the targeted String.

String.trim() Java文档指出:

/**
  * Returns a string whose value is this string, with any leading and trailing
  * whitespace removed.
  * ...
  */

这与上面的引述几乎相同。

从 Java 11 开始,String.trim()String.strip() 到底有什么区别?

简而言之:strip()trim() 的“Unicode 感知”演变。含义 trim() 仅删除 <= U+0020 (space) 的字符; strip() 删除所有 Unicode 白色 space 字符(但不是所有控制字符,例如 \0)

CSR : JDK-8200378

Problem

String::trim 在 Unicode

的 Java 早期就已经存在

had not fully evolved to the standard we widely use today.

The definition of space used by String::trim is any code point less than or equal to the space code point (\u0020), commonly referred to as ASCII or ISO control characters.

Unicode-aware trimming routines should use Character::isWhitespace(int).

Additionally, developers have not been able to specifically remove indentation white space or to specifically remove trailing white space.

Solution

Introduce trimming methods that are Unicode white space aware and provide additional control of leading only or trailing only.

这些新方法的一个共同特征是它们使用与 String.trim() 等旧方法不同的(更新的)“whitespace”定义。错误 JDK-8200373.

The current JavaDoc for String::trim does not make it clear which definition of "space" is being used in the code. With additional trimming methods coming in the near future that use a different definition of space, clarification is imperative. String::trim uses the definition of space as any codepoint that is less than or equal to the space character codepoint (\u0020.) Newer trimming methods will use the definition of (white) space as any codepoint that returns true when passed to the Character::isWhitespace predicate.

方法 isWhitespace(char) 已通过 JDK 1.1 添加到 Character,但方法 isWhitespace(int) 未引入 Character class 直到 JDK 1.5。添加了后一种方法(接受类型 int 的参数的方法)以支持增补字符。 Java Character class 的文档注释定义了补充字符(通常使用基于 int 的“代码点”建模)与 BMP 字符(通常使用单个字符建模):

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values ... A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. ... The methods that only accept a char value cannot support supplementary characters. ... The methods that accept an int value support all Unicode characters, including supplementary characters.

打开JDK Changeset.


trim()strip() 之间的基准比较 -

这是一个单元测试,说明了@MikhailKholodkov 的答案,使用 Java 11。

(请注意 \u2000\u0020 之上并且 trim() 不认为是空格)

public class StringTestCase {
    @Test
    public void testSame() {
        String s = "\t abc \n";

        assertEquals("abc", s.trim());
        assertEquals("abc", s.strip());
    }

    @Test
    public void testDifferent() {
        Character c = '\u2000';
        String s = c + "abc" + c;

        assertTrue(Character.isWhitespace(c));
        assertEquals(s, s.trim());
        assertEquals("abc", s.strip());
    }
}

一般来说,这两种方法都会从字符串中删除前导和尾随 space。然而,当我们使用 unicode 字符或多语言功能时,区别就来了。

trim() 删除所有前导字符和尾随字符其ASCII值小于或等于32('U+0020'或space ).

根据 Unicode 标准,有各种 space 字符的 ASCII 值大于 32(‘U+0020’)。例如:8193(U+2001).

为了识别这些 space 字符,从 Java 1.5 的字符 class 添加了新方法 isWhitespace(int)。此方法使用 unicode 来识别 space 个字符。您可以阅读有关 unicode space 字符 here.

的更多信息

在java 11中添加的新方法strip 使用这个Character.isWhitespace(int)方法来覆盖大范围的白色space字符 并删除它们。

例子

public class StringTrimVsStripTest {
    public static void main(String[] args) {
        String string = '\u2001'+"String    with    space"+ '\u2001';
        System.out.println("Before: \"" + string+"\"");
        System.out.println("After trim: \"" + string.trim()+"\"");
        System.out.println("After strip: \"" + string.strip()+"\"");
   }
}

输出

Before: "  String    with    space  "
After trim: " String    with    space "
After strip: "String    with    space"

注意: 如果您在 windows 机器上 运行,由于 unicode 集有限,您可能看不到类似的输出。您可以尝试一些在线编译器来测试此代码。

strip()trim() 导致不同输出的示例:

String s = "test string\u205F";
String striped = s.strip();
System.out.printf("'%s'%n", striped);//'test string'

String trimmed = s.trim();
System.out.printf("'%s'%n", trimmed);//'test string '