Java 转换器 w3c.dom.document 到输入流

Question

我的场景是这样的：

我有一个 HTML，我将其加载到 w3c.dom.Document 中，在将其作为文档加载后，我解析了它的节点并对它们的值进行了一些更改，但现在我需要将此文档转换为字符串，或者最好直接转换为 InputStream。

我设法做到了，但是，最终我需要这个 HTML 它必须保留初始文件的一些属性，例如（这是我一直在努力的一件事试图解决），必须关闭所有标签。

说，我在 header 上有一个 link 标签，<link .... /> 我需要末尾的破折号 (/)。但是，在转换器将我的文档转换为输出流（然后我继续将其发送到输入流）之后，> 之前的所有“/”都消失了。我所有以 /> 结尾的标签都更改为简单的 >.

我需要这个结构的原因是我正在使用的一个库（恐怕我不能去寻找另一个，特别是现在不能）要求关闭所有标签，否则它会到处抛出异常，我的程序就会崩溃....

有没有人对我有什么好的想法或解决方案？这是我第一次接触 Transform class，所以我可能遗漏了一些可以帮助我的东西。

非常感谢大家，

热烈的问候

一些代码来解释一下场景

DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
org.w3c.dom.Document doc = docBuilder.parse(his); // his = the HTML inputStream

XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//*[@id='pessoaNome']";
org.w3c.dom.Element pessoaNome = null;

try 
{
    pessoaNome = (org.w3c.dom.Element) (Node) xPath.compile(expression).evaluate(doc, XPathConstants.NODE);
} 
catch (Exception e) 
{
    e.printStackTrace();
}

pessoaNome.setTextContext("The new values for the node");
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Source xmlSource = new DOMSource(doc);
Result outputTarget = new StreamResult(outputStream);

Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, "HTML");
transformer.transform(xmlSource, outputTarget);
InputStream is = new ByteArrayInputStream(outputStream.toByteArray()); // At this point outputStream is already all messed up, not just the '/'. but this is the only thing causing me problems

正如@Lee 指出的那样，我将其更改为使用 Jsoup。代码变得更简洁了，只需要设置 outputSettings 就可以让它像魅力一样工作。下面的代码

org.jsoup.nodes.Document doc = Jsoup.parse(new File(HTML), "UTF-8");

org.jsoup.nodes.Element pessoaNome = doc.getElementById("pessoaNome");

pessoaNome.html("My new html in here");

OutputSettings oSettings = new OutputSettings();
oSettings.syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);
doc.outputSettings(oSettings);
InputStream is = new ByteArrayInputStream(doc.outerHtml().getBytes());

Answer 1

看看 jTidy which cleans HTML. There is also jsoup 哪个更新，因为据说做同样的事情只会更好。

Java 转换器 w3c.dom.document 到输入流

Java transformer w3c.dom.document to inputstream

java

dom

xpath

inputstream

transformer