如何在 Java 中获取 UTF8 字符集的完整列表
How to get full list of UTF8 charset in Java
我想添加一个测试套件,它将 运行 覆盖整个 Unicode 字符集。有没有办法获得 Unicode 字符的完整列表?大多数在线资源都在谈论如何编码和解码但没有找到有用的东西material在谈论如何获得完整列表。
TL;DR: 你可能想跳到下面的“可见代码点”部分。
所有代码点
每个 Unicode character (code point) can be encoded in UTF-8。正如维基百科所说:
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.
Unicode 包含 1,114,112 个代码点,范围从 0hex 到 10FFFFhex.
因此,要获取所有 UTF-8 字符:
// Build string with every Unicode character
int[] codePoints = new int[0x110000]; // 0 - 0x10FFFF
for (int i = 0; i < codePoints.length; i++)
codePoints[i] = i;
String allChars = new String(codePoints, 0, codePoints.length);
// Convert to UTF-8
byte[] allUtf8Sequences = allChars.getBytes(StandardCharsets.UTF_8);
// Print statistics
System.out.printf("Code points: %d = 0x%1$x%n", codePoints.length);
System.out.printf("Java chars : %d = 0x%1$x%n", allChars.length());
System.out.printf(" Surrogate pairs: %d = 0x%1$x%n", allChars.length() - codePoints.length);
System.out.printf("UTF-8 bytes: %d = 0x%1$x%n", allUtf8Sequences.length);
System.out.printf(" Average bytes per code point: %.2f%n", (double) allUtf8Sequences.length / codePoints.length);
输出
Code points: 1114112 = 0x110000
Java chars : 2162688 = 0x210000
Surrogate pairs: 1048576 = 0x100000
UTF-8 bytes: 4384642 = 0x42e782
Average bytes per code point: 3.94
可见代码点
请注意,并非所有代码点当前都由 Unicode 定义。如果要限制为定义的字符,请使用 Character.isDefined(codePoint)
。
您可能也不想跳过控制字符和空白字符。要跳过所有这些,只检查可见字符,我们可以使用 Character.getType(codePoint)
:
检查字符类型
// Build string with visible Unicode characters
int[] codePoints = new int[Character.MAX_CODE_POINT + 1];
int count = 0;
for (int codePoint = 0; codePoint < codePoints.length; codePoint++) {
switch (Character.getType(codePoint)) {
case Character.UNASSIGNED:
case Character.CONTROL: // Cc
case Character.FORMAT: // Cf
case Character.PRIVATE_USE: // Co
case Character.SURROGATE: // Cs
case Character.SPACE_SEPARATOR: // Zs
case Character.LINE_SEPARATOR: // Zl
case Character.PARAGRAPH_SEPARATOR: // Zp
break; // Skip
default:
codePoints[count++] = codePoint;
}
}
String chars = new String(codePoints, 0, count);
// Convert to UTF-8
byte[] utf8bytes = chars.getBytes(StandardCharsets.UTF_8);
// Print statistics
System.out.printf("Code points: %d = 0x%1$x%n", count);
System.out.printf("Java chars : %d = 0x%1$x%n", chars.length());
System.out.printf(" Surrogate pairs: %d = 0x%1$x%n", chars.length() - count);
System.out.printf("UTF-8 bytes: %d = 0x%1$x%n", utf8bytes.length);
System.out.printf(" Average bytes per code point: %.2f%n", (double) utf8bytes.length / count);
输出
Code points: 143679 = 0x2313f
Java chars : 231980 = 0x38a2c
Surrogate pairs: 88301 = 0x158ed
UTF-8 bytes: 517331 = 0x7e4d3
Average bytes per code point: 3.60
我想添加一个测试套件,它将 运行 覆盖整个 Unicode 字符集。有没有办法获得 Unicode 字符的完整列表?大多数在线资源都在谈论如何编码和解码但没有找到有用的东西material在谈论如何获得完整列表。
TL;DR: 你可能想跳到下面的“可见代码点”部分。
所有代码点
每个 Unicode character (code point) can be encoded in UTF-8。正如维基百科所说:
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.
Unicode 包含 1,114,112 个代码点,范围从 0hex 到 10FFFFhex.
因此,要获取所有 UTF-8 字符:
// Build string with every Unicode character
int[] codePoints = new int[0x110000]; // 0 - 0x10FFFF
for (int i = 0; i < codePoints.length; i++)
codePoints[i] = i;
String allChars = new String(codePoints, 0, codePoints.length);
// Convert to UTF-8
byte[] allUtf8Sequences = allChars.getBytes(StandardCharsets.UTF_8);
// Print statistics
System.out.printf("Code points: %d = 0x%1$x%n", codePoints.length);
System.out.printf("Java chars : %d = 0x%1$x%n", allChars.length());
System.out.printf(" Surrogate pairs: %d = 0x%1$x%n", allChars.length() - codePoints.length);
System.out.printf("UTF-8 bytes: %d = 0x%1$x%n", allUtf8Sequences.length);
System.out.printf(" Average bytes per code point: %.2f%n", (double) allUtf8Sequences.length / codePoints.length);
输出
Code points: 1114112 = 0x110000
Java chars : 2162688 = 0x210000
Surrogate pairs: 1048576 = 0x100000
UTF-8 bytes: 4384642 = 0x42e782
Average bytes per code point: 3.94
可见代码点
请注意,并非所有代码点当前都由 Unicode 定义。如果要限制为定义的字符,请使用 Character.isDefined(codePoint)
。
您可能也不想跳过控制字符和空白字符。要跳过所有这些,只检查可见字符,我们可以使用 Character.getType(codePoint)
:
// Build string with visible Unicode characters
int[] codePoints = new int[Character.MAX_CODE_POINT + 1];
int count = 0;
for (int codePoint = 0; codePoint < codePoints.length; codePoint++) {
switch (Character.getType(codePoint)) {
case Character.UNASSIGNED:
case Character.CONTROL: // Cc
case Character.FORMAT: // Cf
case Character.PRIVATE_USE: // Co
case Character.SURROGATE: // Cs
case Character.SPACE_SEPARATOR: // Zs
case Character.LINE_SEPARATOR: // Zl
case Character.PARAGRAPH_SEPARATOR: // Zp
break; // Skip
default:
codePoints[count++] = codePoint;
}
}
String chars = new String(codePoints, 0, count);
// Convert to UTF-8
byte[] utf8bytes = chars.getBytes(StandardCharsets.UTF_8);
// Print statistics
System.out.printf("Code points: %d = 0x%1$x%n", count);
System.out.printf("Java chars : %d = 0x%1$x%n", chars.length());
System.out.printf(" Surrogate pairs: %d = 0x%1$x%n", chars.length() - count);
System.out.printf("UTF-8 bytes: %d = 0x%1$x%n", utf8bytes.length);
System.out.printf(" Average bytes per code point: %.2f%n", (double) utf8bytes.length / count);
输出
Code points: 143679 = 0x2313f
Java chars : 231980 = 0x38a2c
Surrogate pairs: 88301 = 0x158ed
UTF-8 bytes: 517331 = 0x7e4d3
Average bytes per code point: 3.60