inputStream 和 utf 8 有时显示“？”人物

Question

所以我已经处理这个问题一个多月了，我也检查了这里几乎所有可能的相关解决方案 google 但我找不到任何能真正解决我的问题的方法. 我的问题是我正在尝试从网站下载 html 源，但在大多数情况下我得到的是一些文本显示一些“？”其中的字符，很可能是因为该站点是希伯来语。这是我的代码，

    public static InputStream openHttpGetConnection(String url)
            throws Exception {
        InputStream inputStream = null;
        HttpClient httpClient = new DefaultHttpClient();
        HttpResponse httpResponse = httpClient.execute(new HttpGet(url));
        inputStream = httpResponse.getEntity().getContent();
        return inputStream;

    }
    public static String downloadSource(String url) {
        int BUFFER_SIZE = 1024;

        InputStream inputStream = null;
        try {
            inputStream = openHttpGetConnection(url);
        } catch (Exception e) {
            // TODO: handle exception
        }
        int bytesRead;
        String str = "";
        byte[] inpputBuffer = new byte[BUFFER_SIZE];
        try {
            while ((bytesRead = inputStream.read(inpputBuffer)) > 0) {
                String read = new String(inpputBuffer, 0, bytesRead,"UTF-8");
                str +=read;

            }
        } catch (Exception e) {
            // TODO: handle exception
        }
        return str;

    }

谢谢。

Answer 1

将 InputStream 转换为字符串需要指定编码，就像您在 new String(inpputBuffer, 0, bytesRead,"UTF-8"); 中所做的那样。

但是你的方法有几个缺点。

你怎么知道你必须使用 UTF8？

在检索HTTP内容时，一般来说，您无法事先知道HTTP响应中将使用什么编码。但是 HTTP 提供了一种机制来指定，使用 Content-Type header.

更具体地说，您的回复 object 应该有一个 Content-Type "header"，其中有一个 "attribute" 叫做 encoding。在响应中，它应该类似于：

Content-Type: text/html; encoding=UTF-8

您应该使用 encoding= 部分之后的任何内容将您的 byte 转换为 char。
看到您似乎使用 Apache HTTPClient，他们的文档说明：

You can set the content type header for a request with the addRequestHeader method in each method and retrieve the encoding for the response body with the getResponseCharSet method.
If the response is known to be a String, you can use the getResponseBodyAsString method which will automatically use the encoding specified in the Content-Type header or ISO-8859-1 if no charset is specified..

替代方式

如果没有Content-Typeheader，而你知道你的内容是HTML，那么你可以尝试转换一下作为使用某种编码（最好是 UTF 或 ISO Latin）的字符串，并尝试找到一些匹配 <meta charset="UTF-8"> 的内容，并将其用作字符集。这应该只是一个 fail-over.

任何字节序列都不能转换为字符串

第二个缺点是您从流中读取任意数量的字节，并尝试将其转换为字符串，这可能是不可能的。

实际上，UTF-8 可以跨多个字节对一些 "characters" 进行编码。例如“é”可以编码为0xC3A9。例如，响应由两个“é”字符组成。如果您第一次致电 read returns :

[c3, a9, c3]

您使用 new String(byte[], off, enc) 转换为字符串将保留最后一个字节，因为它与有效的 UTF8 序列不匹配。

您的后续阅读将获取剩余内容

[a9]

哪个（不管是什么）不是“é”字符。

底线：您甚至无法使用您的模式将有效的 UTF-8 序列转换为字节。

展望未来：您使用 HTTPClient，使用他们的 HTTP 响应到字符串转换的方法。如果你想自己做，简单的方法是将你的输入复制到一个字节数组，然后转换字节数组。类似于（伪代码）：

ByteArrayOutputStream responseContent = new ByteArrayOutputStream()
copyAllBytes(responseInputStream, responseContent)
byte[] rawResponse = responseContent.toByteArray();
String stringResponse = new String(rawResponse, encoding);

但是如果你想要一个完全流式的实现（一个不将响应完全缓冲到内存中的实现），你也可以使用 CharsetDecoder，或者作为@jas 的回答，将你的 inputStream 包装到 reader 并连接输出（最好连接到 StringBuilder 中，如果要发生大量连接，它应该更快）。

Answer 2

要从具有给定编码的字节流中读取字符，请使用 Reader。在你的情况下，它会是这样的：

    InputStreamReader isr = new InputStreamReader(inpputStream, "UTF-8");
    char[] inputBuffer = new char[BUFFER_SIZE];

    while ((charsRead = isr.read(inputBuffer, 0, BUFFER_SIZE)) > 0) {
        String read = new String(inputBuffer, 0, charsRead);
        str += read;
    }

您可以看到字节将直接作为字符读入 --- reader 的问题是要知道它是否需要读取一个或两个字节，例如，在中创建字符缓冲区。这基本上是您的方法，但在读入字节时解码，而不是在读入字节后解码。

inputStream 和 utf 8 有时显示“？”人物

inputStream and utf 8 sometimes shows "?" characters

java

utf-8

你怎么知道你必须使用 UTF8？

任何字节序列都不能转换为字符串