从 URL returns 个奇怪的字符中获取内容

Question

我正在使用此方法从 url:

中获取内容

public String getContentFromURL(String stringUrl) throws UnsupportedEncodingException{
    String content = "";
    try {
        URL url = new URL(stringUrl);
        URLConnection urlc = url.openConnection();
        StringBuilder builder;
        try (BufferedReader buffer = new BufferedReader(new InputStreamReader(urlc.getInputStream(), "UTF-8"))) {
            builder = new StringBuilder();
            int byteRead;
            while ((byteRead = buffer.read()) != -1)
                builder.append((char) byteRead);
        }
        content=builder.toString();
        return content;
    } catch (MalformedURLException ex) {
        Logger.getLogger(Utils.class.getName()).log(Level.SEVERE, null, ex);
    } catch (IOException ex) {
        Logger.getLogger(Utils.class.getName()).log(Level.SEVERE, null, ex);
    }
    return content;
}

对于我得到的大多数文件，它工作正常，除了那些来自其他语言的字符，例如：áí 等...而不是我得到的那些字符 �.

我试过这样设置 tomcat 连接器：

       <Connector port="8080" protocol="HTTP/1.1" URIEncoding="UTF-8"
       connectionTimeout="20000"
       redirectPort="8443" />

页面编码为：<%@page contentType="text/html" pageEncoding="UTF-8"%>

也在 servlet 中添加了这个：

response.setContentType("text/html;charset=UTF-8");
response.setCharacterEncoding("UTF-8");
request.setCharacterEncoding("UTF-8");

尝试将内容解码为 GZIP。

None 以上选项对我有用。

这是 url 我试图从以下位置获取内容：

https://www.dropbox.com/s/kpbrx26bwhoa1rp/moment.js?raw=1

它是 dropbox 中的一个文件，即使是浏览器也可以正确读取，使用 raw=1 直接获取文件的内容。在浏览器中，尝试搜索 "[Môre om] 以检查它是否正确显示。

从包含奇怪字符的 URL 获取内容的正确方法是什么？

PD：使用 notepad++ 我确定它的编码是 utf-8

PD2：从连接获取字符编码 returns 空。

更新： 使用 Google Guava 库尝试此代码：

        String content = "";
        URLConnection url = new URL("https://www.dropbox.com/s/kpbrx26bwhoa1rp/moment.js?raw=1").openConnection();

        InputStream stream = url.getInputStream();
        content = CharStreams.toString(new InputStreamReader(stream, Charsets.UTF_8));
        Closeables.closeQuietly(stream);

        try (PrintStream outText = new PrintStream(new FileOutputStream("C:\Users\myUser\Desktop\test.txt"))) {
            outText.print(content);
            outText.close();
        }

它在普通 java 项目上确实有效，所有字符都正确显示，但在 Java Web App 项目上不正确，这是我尝试此方法的索引：

<%@page import="java.io.PrintStream"%>
<%@page import="java.io.FileOutputStream"%>
<%@page import="com.google.common.io.Closeables"%>
<%@page import="java.io.InputStreamReader"%>
<%@page import="com.google.common.io.CharStreams"%>
<%@page import="com.google.common.base.Charsets"%>
<%@page import="java.io.InputStream"%>
<%@page import="java.net.URLConnection"%>
<%@page import="java.net.URL"%>
<%@page contentType="text/html" pageEncoding="UTF-8"%>
<!DOCTYPE html>
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>JSP Page</title>
</head>
<body>
    <%
        response.setContentType("text/html;charset=UTF-8");
        response.setCharacterEncoding("UTF-8");
        request.setCharacterEncoding("UTF-8");

        String content = "";
        URLConnection url = new URL("https://www.dropbox.com/s/kpbrx26bwhoa1rp/moment.js?raw=1").openConnection();

        InputStream stream = url.getInputStream();
        content = CharStreams.toString(new InputStreamReader(stream, Charsets.UTF_8));
        Closeables.closeQuietly(stream);

        try (PrintStream outText = new PrintStream(new FileOutputStream("C:\Users\myUser\Desktop\test.txt"))) {
            outText.print(content);
            outText.close();
        }
    %>
</body>
</html>

当我查看创建的文件时，这些 � 仍然出现。 为什么相同的代码在独立应用程序和 Web 应用程序之间表现不同？

已解决： 替换

try (PrintStream outText = new PrintStream(new FileOutputStream("C:\Users\myUser\Desktop\test.txt"))) {
            outText.print(content);
            outText.close();
        }

和

Writer outText = new BufferedWriter(new OutputStreamWriter( new FileOutputStream("C:\Users\myUser\Desktop\testRaw.txt"), "UTF-8"));
        try {
            outText.write(content);
        } finally {
            outText.close();
        }

Answer 1

我把你的代码变成了这样一个最小的例子，同时去掉了奇怪的位（BufferedReader 的目的是避免逐字符读取）。我得到了非常好的 UTF8。尝试运行这个，重定向到一个文件并使用支持 Unicode 的文本编辑器检查输出。

import java.util.*;
import java.net.*;
import java.io.*;

public class UTF8Test {

public static void main(String[] args) throws Exception {
        //System.out.println(getContentFromURL("http://www.columbia.edu/~kermit/utf8.html"));
        System.out.println(getContentFromURL("https://www.dropbox.com/s/kpbrx26bwhoa1rp/moment.js?raw=1"));
    }

    public static String getContentFromURL(String stringUrl) throws Exception {
        URL url = new URL(stringUrl);
        URLConnection urlc = url.openConnection();
        StringBuilder builder = new StringBuilder();
        BufferedReader breader = new BufferedReader(new InputStreamReader(urlc.getInputStream(), "UTF-8"));
        String line = "";
        while ((line = breader.readLine()) != null) {
            builder.append(line);
        }

        return builder.toString();
    }
}

Answer 2

您使用默认编码编写文本，最好将其存储为 UTF-8。

try (PrintStream outText = new PrintStream(
        new File("C:\Users\myUser\Desktop\test.txt"), "UTF-8")) {
    if (!content.startsWith("\uFEFF")) {
        outText.print("\uFEFF");
    }
    outText.print(content);
} // Calls outText.close()

这会在开头写入带有 BOM 字符 '\uFEFF' 的文本。那是一个不可见的 zero-width space，Windows 可以用来检测 UTF-8。这实际上是一种不好的做法，但允许在记事本中编辑文本。

错误是某些 Unicode 字符无法映射到默认编码。

旁白：您假设 URL 中的文本是 UTF-8 格式的。一般来说，最好通过 URLConnection headers.

来检查它

String encoding = urlc.getContentEncoding();
if (encoding == null) {
    encoding = "UTF-8";
} else if (encoding.equalsIgnoreCase("ISO-8859-1")) { // Latin-1
    encoding = "Windows-1252"; // Windows Latin-1
}

Latin-1 补丁可能有用，因为任何操作系统上的所有浏览器都将 ISO-8859-1 解释为 Windows-1252；现在正式 HTML5.

从 URL returns 个奇怪的字符中获取内容

Getting content from URL returns strange characters

java

jsp

utf-8

character-encoding

servlet-3.0