StringBuilder#appendCodePoint(int) 行为异常
StringBuilder#appendCodePoint(int) behaves unexpectedly
java.lang.StringBuilder 的 appendCodePoint(...) 方法,对我来说,表现出乎意料的方式。
对于 Character.MAX_VALUE 以上的 unicode 代码点(需要 3 或 4 个字节以 UTF-8 编码,这是我的 Eclipse 工作区设置),它的行为很奇怪。
我将一个字符串的Unicode代码点一个一个地追加到一个StringBuilder中,但是最后它的输出看起来不一样了。
我怀疑在 AbstractStringBuilder#appendCodePoint(...) 中对 Character.toSurrogates(codePoint, value, count) 的调用会导致此问题,但我不知道如何解决它。
我的代码:
// returns random string in range of unicode code points 0x2F800 to 0x2FA1F
// e.g.
String s = getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(length);
System.out.println(s);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < getCodePointCount(s); i++) {
sb.appendCodePoint(s.codePointAt(i));
}
// prints some of the CJK characters, but between them there is a '?'
// e.g. ???????????????
System.out.println(sb.toString());
// returns random string in range of unicode code points 0x20000 to 0x2A6DF
// e.g.
s = getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(length);
// prints the CJK characters correctly
System.out.println(s);
sb = new StringBuilder();
for (int i = 0; i < getCodePointCount(s); i++) {
sb.appendCodePoint(s.codePointAt(i));
}
// prints some of the CJK characters, but between them there is a '?'
// e.g. ???????????????
System.out.println(sb.toString());
与:
public static int getCodePointCount(String s) {
return s.codePointCount(0, s.length());
}
public static String getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(int length) {
return getRandomStringOfMaxLengthInRange(length, 0x20000, 0x2A6DF);
}
public static String getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(int length) {
return getRandomStringOfMaxLengthInRange(length, 0x2F800, 0x2FA1F);
}
private static String getRandomStringOfMaxLengthInRange(int length, int from, int to) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++) {
// try to find a valid character MAX_TRIES times
for (int j = 0; j < MAX_TRIES; j++) {
int unicodeInt = from + random.nextInt(to - from);
if (Character.isValidCodePoint(unicodeInt) &&
(Character.isLetter(unicodeInt) || Character.isDigit(unicodeInt) ||
Character.isWhitespace(unicodeInt))) {
sb.appendCodePoint(unicodeInt);
break;
}
}
}
return new String(sb.toString().getBytes(), "UTF-8");
}
您错误地迭代了代码点。您应该使用 Jonathan Feinberg here
提出的策略
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
或自 Java 8
s.codePoints().forEach(/* do something */);
的 Java文档
Returns the character (Unicode code point) at the specified index. The
index refers to char values (Unicode code units) and ranges from 0 to
length()- 1.
您正在从 0 迭代到 codePointCount
。如果角色不是高低代理对,则单独返回。在这种情况下,您的索引应该只增加 1。否则,它应该增加 2(Character#charCount(int)
处理这个),因为您正在获得与该对对应的代码点。
从此改变你的循环:
for (int i = 0; i < getCodePointCount(s); i++) {
对此:
for (int i = 0; i < getCodePointCount(s); i = s.offsetByCodePoints(i, 1)) {
在 Java 中,字符是单个 UTF-16 值。补充代码点在一个字符串中占用两个字符。
但是您正在循环字符串中的每个字符。这意味着您正在阅读每个补充代码点两次:第一次,您正在阅读它的两个 UTF-16 代理字符;第二次,您正在读取和附加低代理项字符。
考虑一个仅包含一个代码点的字符串,0x2f8eb
。表示该代码点的 Java 字符串实际上包含以下内容:
"\ud87e\udceb"
如果您遍历每个单独的字符索引,那么您的循环将有效地执行此操作:
sb.appendCodePoint(0x2f8eb); // codepoint found at index 0
sb.appendCodePoint(0xdceb); // codepoint found at index 1
java.lang.StringBuilder 的 appendCodePoint(...) 方法,对我来说,表现出乎意料的方式。
对于 Character.MAX_VALUE 以上的 unicode 代码点(需要 3 或 4 个字节以 UTF-8 编码,这是我的 Eclipse 工作区设置),它的行为很奇怪。
我将一个字符串的Unicode代码点一个一个地追加到一个StringBuilder中,但是最后它的输出看起来不一样了。 我怀疑在 AbstractStringBuilder#appendCodePoint(...) 中对 Character.toSurrogates(codePoint, value, count) 的调用会导致此问题,但我不知道如何解决它。
我的代码:
// returns random string in range of unicode code points 0x2F800 to 0x2FA1F
// e.g.
String s = getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(length);
System.out.println(s);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < getCodePointCount(s); i++) {
sb.appendCodePoint(s.codePointAt(i));
}
// prints some of the CJK characters, but between them there is a '?'
// e.g. ???????????????
System.out.println(sb.toString());
// returns random string in range of unicode code points 0x20000 to 0x2A6DF
// e.g.
s = getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(length);
// prints the CJK characters correctly
System.out.println(s);
sb = new StringBuilder();
for (int i = 0; i < getCodePointCount(s); i++) {
sb.appendCodePoint(s.codePointAt(i));
}
// prints some of the CJK characters, but between them there is a '?'
// e.g. ???????????????
System.out.println(sb.toString());
与:
public static int getCodePointCount(String s) {
return s.codePointCount(0, s.length());
}
public static String getRandomChineseJapaneseKoreanStringExtensionBOfMaxLength(int length) {
return getRandomStringOfMaxLengthInRange(length, 0x20000, 0x2A6DF);
}
public static String getRandomChineseJapaneseKoreanStringCompatibilitySupplementOfMaxLength(int length) {
return getRandomStringOfMaxLengthInRange(length, 0x2F800, 0x2FA1F);
}
private static String getRandomStringOfMaxLengthInRange(int length, int from, int to) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++) {
// try to find a valid character MAX_TRIES times
for (int j = 0; j < MAX_TRIES; j++) {
int unicodeInt = from + random.nextInt(to - from);
if (Character.isValidCodePoint(unicodeInt) &&
(Character.isLetter(unicodeInt) || Character.isDigit(unicodeInt) ||
Character.isWhitespace(unicodeInt))) {
sb.appendCodePoint(unicodeInt);
break;
}
}
}
return new String(sb.toString().getBytes(), "UTF-8");
}
您错误地迭代了代码点。您应该使用 Jonathan Feinberg here
提出的策略final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
或自 Java 8
s.codePoints().forEach(/* do something */);
的 Java文档
Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length()- 1.
您正在从 0 迭代到 codePointCount
。如果角色不是高低代理对,则单独返回。在这种情况下,您的索引应该只增加 1。否则,它应该增加 2(Character#charCount(int)
处理这个),因为您正在获得与该对对应的代码点。
从此改变你的循环:
for (int i = 0; i < getCodePointCount(s); i++) {
对此:
for (int i = 0; i < getCodePointCount(s); i = s.offsetByCodePoints(i, 1)) {
在 Java 中,字符是单个 UTF-16 值。补充代码点在一个字符串中占用两个字符。
但是您正在循环字符串中的每个字符。这意味着您正在阅读每个补充代码点两次:第一次,您正在阅读它的两个 UTF-16 代理字符;第二次,您正在读取和附加低代理项字符。
考虑一个仅包含一个代码点的字符串,0x2f8eb
。表示该代码点的 Java 字符串实际上包含以下内容:
"\ud87e\udceb"
如果您遍历每个单独的字符索引,那么您的循环将有效地执行此操作:
sb.appendCodePoint(0x2f8eb); // codepoint found at index 0
sb.appendCodePoint(0xdceb); // codepoint found at index 1