使用飞碟 PDF 渲染将格式错误的 HTML 转换为 PDF

Convert malformed HTML to PDF using Flying Saucer PDF Rendering

在一个项目 GitHub 中,我正在尝试将任意 HTML 字符串转换为 PDF 版本。我所说的转换是指解析 HTML,并将其呈现为 PDF 文件。

为了实现这一点,我使用 Flying Saucer PDF Rendering 这样的:

Main.java

public class Main {

    public static void main(String [] args) {
        final String ok = "<valid html here>: see github rep for real html markup here";
        final String html = "<invalid html here>: see github rep for real html markup here";
        try {
            // final byte[] bytes = generatePDFFrom(ok); // works!
            final byte[] bytes = generatePDFFrom(html); // does NOT work :(
            try(FileOutputStream fos = new FileOutputStream("sample-file.pdf")) {
                fos.write(bytes);
            }

        } catch (IOException | DocumentException e) {
            e.printStackTrace();
        }
    }

    private static byte[] generatePDFFrom(String html) throws IOException, DocumentException {
        final ITextRenderer renderer = new ITextRenderer();
        renderer.setDocumentFromString(html);
        renderer.layout();
        try (ByteArrayOutputStream fos = new ByteArrayOutputStream(html.length())) {
            renderer.createPDF(fos);
            return fos.toByteArray();
        }
    }
}

在上面的代码中,如果我使用存储在 ok 变量中的 html 字符串(这是一个 "valid" html),它会正确创建 PDF(如果你 运行 GitHub 项目通过使用 ok 变量,它将在项目文件夹中创建一个文件 sample-file.pdf,其中包含一些渲染的 html).

现在,如果我使用 html 变量中的值(html 带有无效标签,标签可能未正确关闭等),它会抛出以下错误(错误可能因不正确的值):

ERROR:  'The markup in the document following the root element must be well-formed.'
Exception in thread "main" org.xhtmlrenderer.util.XRRuntimeException: Can't load the XML resource (using TrAX transformer). org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.transform(XMLResource.java:222)
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.createXMLResource(XMLResource.java:181)
    at org.xhtmlrenderer.resource.XMLResource.load(XMLResource.java:84)
    at org.xhtmlrenderer.pdf.ITextRenderer.setDocumentFromString(ITextRenderer.java:171)
    at org.xhtmlrenderer.pdf.ITextRenderer.setDocumentFromString(ITextRenderer.java:166)
    at Main.generatePDFFrom(Main.java:84)
    at Main.main(Main.java:72)
Caused by: javax.xml.transform.TransformerException: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:740)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:343)
    at org.xhtmlrenderer.resource.XMLResource$XMLResourceBuilder.transform(XMLResource.java:220)
    ... 6 more
Caused by: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 9; The markup in the document following the root element must be well-formed.
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:659)
    at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:728)
    ... 8 more

现在,据我所知,这是因为 html 字符串的 "invalid" 部分。

重要提示:

问题

最初的想法是通过另一个能够更好地处理 html 的库来解析您的输入,然后 toString() 该库的结果转换为 PDF渲染器。

https://jsoup.org/

五分钟的谷歌搜索发现这是一个非常合理的库。甚至还有一个测试实用程序,您可以尝试将格式错误的输入输入:

https://try.jsoup.org/

由于我在使用 Flying Saucer 从 HTML 生成 PDF 时遇到了同样的问题,所以我在解析为 Flying Saucer 之前使用 HtmlCleaner library (see maven link) 清理了 HTML 代码图书馆。

// Clean the html to use in the flying saucer converting tool
// get the element you want to serialize
HtmlCleaner cleaner = new HtmlCleaner();
TagNode rootTagNode = cleaner.clean(html);
// set up properties for the serializer (optional, see online docs)
CleanerProperties cleanerProperties = cleaner.getProperties();
// use the getAsString method on an XmlSerializer class
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String cleanedHtml = xmlSerializer.getAsString(rootTagNode);

// use the https://github.com/flyingsaucerproject/flyingsaucer to convert cleaned HTML to PDF
ITextRenderer renderer = new ITextRenderer();
renderer.setDocumentFromString(cleanedHtml);
// ....