从 UTF-8 文件读取时字符数字不正确

Question

所以我正在使用扫描仪读取文件。但是我不明白如果文件是 UTF-8 文件，并且在遍历文件时正在读取的当前行包含数字，则方法 Character.isDigit(line.charAt(0)) returns 为 false。但是，如果文件不是 UTF-8 文件，方法 returns true.

这是一些代码

File theFile = new File(pathToFile);
Scanner fileContent = new Scanner(new FileInputStream(theFile), "UTF-8");
while(fileContent.hasNextLine())
{
    String line = fileContent.nextLine();
    if(Character.isDigit(line.charAt(0)))
    {
         //When the file being read from is NOT a UTF-8 file, we get down here
    }

当使用调试器并查看 line 字符串时，我可以看到在这两种情况下（UTF-8 文件与否）字符串似乎保持相同，一个数字。为什么会这样？

Answer 1

通过交换意见最终发现，您的文件包含 BOM。这通常不推荐用于 UTF-8 文件，因为 Java 不期望它并将其视为数据。

所以你有两个选择：

如果您控制该文件，请在没有 BOM 的情况下复制它
如果不存在，则检查文件是否存在BOM，将其移除后再进行其他操作。

这里是一些开始的代码。它宁愿跳过而不是删除 BOM。随意修改你喜欢的。它在我几年前写的一些测试实用程序中：

private static InputStream filterBOMifExists(InputStream inputStream) throws IOException {
        PushbackInputStream pushbackInputStream = new PushbackInputStream(new BufferedInputStream(inputStream), 3);
        byte[] bom = new byte[3];
        if (pushbackInputStream.read(bom) != -1) {
            if (!(bom[0] == (byte) 0xEF && bom[1] == (byte) 0xBB && bom[2] == (byte) 0xBF)) {
                pushbackInputStream.unread(bom);
            }
        }
        return pushbackInputStream;
    }

从 UTF-8 文件读取时字符数字不正确

Character digit not true when read from UTF-8 file

java

utf-8

filestream

chars