Java Scanner.nextLine() 错误地将 unicode (emoji) 解析为换行

Java Scanner.nextLine() mistakenly parses unicode (emoji) as a new line

最简单的例子:

    String test = "salut ð\u009F\u0098\u0085 test";
    Scanner scan = new Scanner(test);
    System.out.println("1:" + scan.nextLine());
    System.out.println("2:" + scan.nextLine());

这是用户输入的一个字符串,所以不幸的是我不是 100% 确定那个 unicode 是什么,但如果我没记错的话,它是一个表情符号(我在发送消息时看到了消息)。

输出为:

    1:salut ð
    2: test

我的预期输出只有 1 行(即示例代码应该给出 NoSuchElementException,因为第二个 nextLine() 应该失败。)。为什么它被解析为两行?什么是潜在的解决方法?

当我在文本编辑器中打开文件时,它正确地没有将该 unicode 视为新行。

Why is it parsing as two lines?

虽然这是一个不常见的代码点,但 U+0085 的 unicode 名称是 下一行 [NEL],我猜它 可以 被视为换行符。

But is there a reason BufferedReader and text editors like Sublime Text don't parse it as an actual new line, while Scanner does?

如果您查看 ScannerBufferedReader 的相应文档:

Scanner.nextLine:

Advances this scanner past the current line and returns the input that was skipped. This method returns the rest of the current line, excluding any line separator at the end. The position is set to the beginning of the next line.

Since this method continues to search through the input looking for a line separator...

BufferedReader.readLine:

Reads a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.

Scanner.nextLine 只是说“行分隔符”是一个非常模糊的术语(它当然不是指 Unicode 类别“行分隔符”,它只有一个代码点),而 BufferedReader.readLine文档准确地说明了一行是什么。

考虑到 Scanner 如何处理本地化的数字格式和内容,我的猜测是它被设计成比 BufferedReader 更“聪明”class。

查看我的 JDK 版本的源代码,Scanner 认为以下字符串是“行分隔符”:

  • \r\n
  • \n
  • \r
  • \u2028
  • \u2029
  • \u0085

The reason why \u0085 is considered a new line character is apparently related to XML parsing.