将 unicode 字符保留在 Java 字符串中

Question

我正在java写一个爬虫来爬取一些网站，这些网站可能有一些unicode字符，比如“£”。当我将内容（来源 HTML）存储在 Java 字符串中时，这些类型的字符丢失并被问号“？”取代。我想知道如何保持它们完好无损。相关代码如下：

protected String readWebPage(String weburl) throws IOException{
        HttpClient httpclient = new DefaultHttpClient();

        HttpGet httpget = new HttpGet(weburl); 
        ResponseHandler<String> responseHandler = new BasicResponseHandler();    
        String responseBody = httpclient.execute(httpget, responseHandler);
        // responseBody now contains the contents of the page
        httpclient.getConnectionManager().shutdown();
        return responseBody;
    }

   // function call
   String res = readWebPage(url);
   PrintWriter out = new PrintWriter(outDir+name+".html");
   out.println(res);
   out.close();

以后在进行字符匹配时，我也希望能够做类似的事情：

if(text.indexOf("£")>=0)

我不知道Java是否会认出那个角色并按照我的意愿去做。

任何输入将不胜感激。提前致谢。

Answer 1

使用以下代码：

FileOutputStream fileStream = new FileOutputStream(outDir+name+".html");
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(fileStream, StandardCharsets.UTF_8)
PrintWriter out = new PrintWriter(outputStreamWriter);

来自Charset

A character-encoding scheme is a mapping between one or more coded character sets and a set of octet (eight-bit byte) sequences. UTF-8, UTF-16, ISO 2022, and EUC are examples of character-encoding schemes. Encoding schemes are often associated with a particular coded character set; UTF-8, for example, is used only to encode Unicode. Some schemes, however, are associated with multiple coded character sets; EUC, for example, can be used to encode characters in a variety of Asian coded character sets.

Answer 2

有两个步骤。首先，将加载的字符串（在 java 中始终是 Unicode）保存为 UTF-8。但是由于浏览器需要知道编码，它在文件系统上只有 HTML 元标记。所以你需要确定，有类似

的东西

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

2. 把HTML写成UTF-8

PrintWriter out = new PrintWriter(outDir+name+".html", "UTF-8");

1.先将原页面的HTML字符集声明修补成UTF-8。

String res2 = res.replaceFirst("charset=([-\w]+)", "charset=UTF-8")
         .replaceFirst("charset=([\"'])([-\w]+)", "charset=UTF-8");
if (res2 == res) { // No charset given
      res2 = res.replaceFirst("(?i)</head>",
              "<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />[=12=]");
}
res = res2;

对于具有 Content-Type 或 (HTML5) 字符集的 HTML 元数据。

Answer 3

您的非 ASCII 字符在输入 Java 或输出时丢失。

Java 在内部使用 Unicode 字符串，因此您必须告诉它如何解码输入和编码输出。

我们假设 HttpClient 正确解释来自远程服务器的响应并正确解码响应。

接下来，您必须确保在将内容写入磁盘时对内容进行正确编码。 Java 使用局部环境变量来猜测使用什么编码，这可能不合适。要强制编码，请将编码类型传递给 PrintWriter：

PrintWriter out = new PrintWriter(outDir+name+".html", "UTF-8");

然后使用文本编辑器（例如 Notepad++）检查您的 output.html，运行以 UTF-8 模式确保您仍然可以看到非 ASCII 字符。

如果不能，则需要将注意力转向输入 - HttpClient。如果您的远程服务器在字符编码方面撒谎，请参阅此答案：以获取线索。

回答你的子问题。如果您告诉 Java 您的源代码所在的字符编码，您可以在源代码中使用非 ASCII 字符，例如“£”。这是 javac 的一个参数，但正如您可能会使用 IDE，您可以简单地在属性中设置文件的字符编码，IDE 将完成剩下的工作。最便携的做法是将 IDE 中的字符编码设置为 "UTF-8"。 Eclipse 允许您为整个项目或单个文件设置字符编码。

将 unicode 字符保留在 Java 字符串中

Keep unicode characters in Java string

java

string

unicode

character

utf-8