如何检索 XML 文件的编码以正确解析它？（最佳实践）

Question

我的应用程序下载 xml 文件，这些文件恰好以 UTF-8 或 ISO-8859-1 编码（生成这些文件的软件很糟糕，所以它这样做了）。我来自德国，所以我们使用变音符号 (ä,ü,ö)，所以这些文件的编码方式确实有所不同。我知道 XmlPullParser 有一个方法 .getInputEncoding() 可以正确检测我的文件是如何编码的。但是我必须已经在我的 FileInputStream 中设置编码（这是在我调用 .getInputEncoding() 之前）。到目前为止，我只是使用 BufferedReader 来读取 XML 文件并搜索指定编码的条目，然后实例化我的 PullParser。

private void setFileEncoding() {
    try {
        bufferedReader.reset();
        String firstLine = bufferedReader.readLine();
        int start = firstLine.indexOf("encoding=") + 10; // +10 to actually start after "encoding="

        String encoding = firstLine.substring(start, firstLine.indexOf("\"", start));

        // now set the encoding to the reader to be used for parsing afterwards
        bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream, encoding));
        bufferedReader.mark(0);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

有没有其他方法可以做到这一点？我可以利用 .getInputEncoding 方法吗？现在这个方法对我来说似乎有点无用，因为如果我已经必须在能够检查它之前设置它，我的编码有什么关系。

Answer 1

如果您相信 XML 的创建者在 XML 声明中正确设置了编码，您可以在执行时嗅探它。但是，请注意它可能是错误的； .

如果您想独立于（可能错误的）XML 声明编码设置直接检测编码，请使用 ICU CharsetDetector 或较旧的 jChardet 等库。

ICU CharsetDetector:

CharsetDetector detector;
CharsetMatch match;
byte[] byteData = ...;

detector = new CharsetDetector();

detector.setText(byteData);
match = detector.detect();

jChardet:

    // Initalize the nsDetector() ;
    int lang = (argv.length == 2)? Integer.parseInt(argv[1])
                                     : nsPSMDetector.ALL ;
    nsDetector det = new nsDetector(lang) ;

    // Set an observer...
    // The Notify() will be called when a matching charset is found.

    det.Init(new nsICharsetDetectionObserver() {
            public void Notify(String charset) {
                HtmlCharsetDetector.found = true ;
                System.out.println("CHARSET = " + charset);
            }
    });

    URL url = new URL(argv[0]);
    BufferedInputStream imp = new BufferedInputStream(url.openStream());

    byte[] buf = new byte[1024] ;
    int len;
    boolean done = false ;
    boolean isAscii = true ;

    while( (len=imp.read(buf,0,buf.length)) != -1) {

            // Check if the stream is only ascii.
            if (isAscii)
                isAscii = det.isAscii(buf,len);

            // DoIt if non-ascii and not done yet.
            if (!isAscii && !done)
                done = det.DoIt(buf,len, false);
    }
    det.DataEnd();

    if (isAscii) {
       System.out.println("CHARSET = ASCII");
       found = true ;
    }

Answer 2

如果您的服务器发送正确，您可能能够从 content-type header 获得正确的 character-set。

如何检索 XML 文件的编码以正确解析它？（最佳实践）

How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

java

xml

encoding

android

xmlpullparser

ICU CharsetDetector:

jChardet:

如何检索 XML 文件的编码以正确解析它？ （最佳实践）

How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

如何检索 XML 文件的编码以正确解析它？（最佳实践）