如何使用 java 识别文件中的特殊字符

Question

我有一个 .doc 文件，在 ÐÏ 之前包含 header，所以我需要删除 ÐÏ 之前存在的所有字符。

示例：asdfasdfasdfasfasdfasfÐÏ9asjdfkj

我使用了下面的代码。

InputStream is = new   FileInputStream("D:\Users\Vinoth\workspace\Testing\Testing_2.doc");
    DataInputStream dis = new DataInputStream(is);
    OutputStream os = new  FileOutputStream("D:\Users\Vinoth\workspace\Testing\Testing_3.doc");
    DataOutputStream dos = new DataOutputStream(os);
    byte[] buff = new byte[dis.available()];
    dis.readFully(buff);
    char temp = 0;
    boolean start = false;
    try{
    for(byte b:buff){
        char c = (char)b;
        if(temp == 'Ð' && c == 'Ï' ){
            start = true;  
        }
        if(start){
            dos.write(c);
        }
        temp = c;

    }

但是，如果条件不满足，它不会在我的文件中写入任何内容。请告诉我如何执行此操作。

Answer 1

使用时出现问题char c = (char)b;

参考byte-and-char-conversion-in-java

你会看到

A character in Java is a Unicode code-unit which is treated as an unsigned number.

以你的案例为例。字符 'Ï' 的字节二进制表示为 11001111。参考oracle tutorial、

byte: The byte data type is an 8-bit signed two's complement integer. It has a minimum value of -128 and a maximum value of 127 (inclusive).

所以byte的值为-49。但是，对于 Unicode 用法，11001111 应该被解释为无符号字节，实际上应该是 207。

int i = b & 0xff;

将得到二进制表示的无符号字节值。

您可以像下面这样修改您的代码。为了方便调试，我更改了文件路径和文件格式。我不确定 .doc 是否是一个问题，但你的代码本身有我实际提到的错误。

import java.io.*;

public class Test {
    public static void main(String args[]){
        InputStream is;
        try {
            is = new   FileInputStream("Testing_2.txt");
            DataInputStream dis = new DataInputStream(is);
            OutputStream os = new  FileOutputStream("Testing_3.txt");
            DataOutputStream dos = new DataOutputStream(os);
            byte[] buff = new byte[dis.available()];
            dis.readFully(buff);
            char temp = 0;
            boolean start = false;
            for(byte b:buff){
                int i = b & 0xff;
                char c = (char)i;
                if(temp == 'Ð' && c == 'Ï' ){
                    start = true;  
                }
                if(start){
                    dos.write(c);
                }
                temp = c;

            }  
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();

        }
    }
}

如何使用 java 识别文件中的特殊字符

How to identify a special character in a file using java

java

escaping

character

special-characters