构建项目上的 UTF8 编码错误（调试项目工作正常）

Question

我的输出编码有一些问题。这是其中一个案例：

"<" + this.strName + ">" + strData + "</" + this.strName + ">"
return DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new ByteArrayInputStream(returnFullTagData(strData).getBytes())).getDocumentElement();

Netbean 的调试工作正常，但是当我运行构建版本时，它抛出 3 字节 UTF-8 序列的无效字节 2。

我解决了这个问题：

new String( ("<" + this.strName + ">" + strData + "</" + this.strName + ">").getBytes(), "UTF-8");

但我需要将其更改为始终像第一个选择一样工作...为什么？，因为：

当我尝试保存新的 XML 文件时，它在 netbeans 调试中正确保存：

<kind schema="">Fonología</kind>

但是，构建版本有同样的编码问题：

<kind schema="">Fonolog?a</kind>

我认为这两个问题有直接关系，但我不知道如何。

当然，作为第一种情况，我尝试通过更改 XML 上输入数据的编码来解决这个问题，但我没有用

编辑

好的，现在我正在使用你的一些答案，我得到了一些非常有趣的东西。

第一种情况，更改为：

strData = "<" + this.strName + ">" + strData2 + "</" + this.strName + ">";
return DocumentBuilderFactory.newInstance().newDocumentBuilder()
                .parse(new InputSource(new StringReader(returnFullTagData(strData))))
                .getDocumentElement();

它运行良好，仅此而已 ??? （不再需要 UnsupportedEncodingException，喜欢它）。

第二个改变是它读取 XML 基础文件的方式

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();

        FileInputStream in = new FileInputStream(new File(strBase));
        doc = dBuilder.parse(in, "UTF-8");

但现在我有另一个问题:

<li>ArtÃculo Definido</li>

而不是

<li>Artículo Definido</li>

这有点棘手，因为我在本文档中使用了两种类型的节点并且 "String Based" 节点打印正确，但 "file based" 节点有这个问题...

我使用的库是 POI, Guava, XMLBeans included with POI and dom4j

PD：同样，它只发生在构建版本时......为什么会发生？，我真的很厌倦尝试调试并且它基本上没用

Answer 1

í 被 ? 替换意味着从 Unicode（java 文本，字符串）使用无法映射的字节的编码转换为字节字母.

使用String.getBytes(StandardCharsets.UTF_8)。（除非存在不同于 UTF-8 的 <?xml ...> 编码。）

避免 s = new String(s.getBytes(), "UTF-8"); 这是一种变通方法，但仍有一些陷阱。

为了良好的秩序：

NetBeans IDE，项目属性/编码：UTF-8
maven pom.xml: <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

项目短期评估后

没有发现任何可疑的东西，试试：

public static void printDocument(Document doc, OutputStream out) throws IOException, TransformerException {
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer();
    //transformer.setOutputProperty("omit-xml-declaration", "no");
    transformer.setOutputProperty("method", "xml");
    transformer.setOutputProperty("indent", "yes");
    //transformer.setOutputProperty("encoding", "UTF-8");
    //transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");

    //transformer.transform(new DOMSource(doc), new StreamResult(new OutputStreamWriter(out, "UTF-8")));
    transformer.transform(new DOMSource(doc), new StreamResult(out));
}

Answer 2

当您在 String 上调用 getBytes() 时，您将获得底层平台默认编码的字节。当您使用 String(byte[]) 构造函数时，您将使用平台的默认编码将字节转换为 String。

当你将这两者结合起来时

return new String(("<" + this.strName + ">" + strData + "</" + this.strName + ">").getBytes());

在最好的情况下，您正在执行从 String 到字节并返回到 String 的过时转换，即如果平台的默认编码可以处理所有字符，并且正在破坏信息，如果不能。然后，看到 ? 而不是这些字符时，不要感到惊讶。

这里有一个简单的解决方案，只需删除那个过时的转换：

return "<" + this.strName + ">" + strData + "</" + this.strName + ">";

当然，既然这些字符没有被破坏，它们可能会在您使用平台默认编码的其他地方出现问题，而这些地方需要UTF-8。您可以搜索 String 和 byte[] 之间的所有转换，并确保所有使用相同的编码，最好是 UTF-8，但您也可以决定删除这些不必要的转换。

如果源是 String 个字符，就这样处理它们：

return DocumentBuilderFactory.newInstance().newDocumentBuilder()
    .parse(new InputSource(new StringReader(returnFullTagData(strData))))
    .getDocumentElement();

没有转换，没有数据丢失……

Answer 3

好的，感谢您的所有帮助，确实有助于解决一些问题，但不是主要问题，但任何改进都非常感谢。问题是 Guava Library 但我不知道为什么会这样。我只是回到我的第一个版本并删除了库； Release 项目开始像 Debug 模式一样正常工作。如果有人能说出为什么会这样，我将更加感激

构建项目上的 UTF8 编码错误（调试项目工作正常）

Wrong encoding UTF8 on Build project (Debug project was working correctly)

java

encoding

utf-8

guava