Docx4j 在将 html 文档转换为 docx 时出现某些样式问题

Docx4j having issue with some styles while converting html document to docx

将此 html 文件转换为文档后,某些样式出现问题。

<html>
<head>
<style>
div,p{ 
    background-color: #ff0000;
    padding: 100px;
    border: 10px solid #000;
    text-align: justify;
    margin-bottom: 50px;
    text-indent: 50px;
}
</style>
</head>
<body>
    <div>test test test <br/>test test test <br/>test test test</div>
    <p>test test test <br/>test test test <br/>test test test</p>
    <p>test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test </p>
</body>
</html>

使用以下单元测试

@Test
public void testConvertXhtml3() throws Exception 
{

        String inputfilepath = "/Users/kyv/Documents/test.html";

        // Create an empty docx package
        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();

        NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
        wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
        ndp.unmarshalDefaultNumbering();        

        XHTMLImporterImpl xHTMLImporter = new XHTMLImporterImpl(wordMLPackage);

        // Convert the XHTML, and add it into the empty docx we made
        wordMLPackage.getMainDocumentPart().getContent().addAll(xHTMLImporter.convert(new File(inputfilepath), null) );


        wordMLPackage.save(new java.io.File("/Users/kyv/Documents/test.docx") );
  }

在控制台中我得到了很多 "How to handle: ..." 日志的一部分

Attempting to load: docx4j.properties
Using paper size: A4
Landscape orientation: false

Set contentType application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml on part /


java.vendor=Oracle Corporation
java.version=1.7.0_55

jar:file:/Users/kvn/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar!/META-INF/MANIFEST.MF
Implementation-Title : JAXB Reference Implementation 
Implementation-Version : 2.2.3
Class-Path : jaxb-api.jar activation.jar jsr173_1.0_api.jar jaxb1-impl.jar
Manifest-Version : 1.0
Specification-Vendor : Oracle Corporation
Created-By : 1.5.0_22-b03 (Sun Microsystems Inc.)
Ant-Version : Apache Ant 1.7.1
Implementation-Vendor : Oracle Corporation
Implementation-Vendor-Id : com.sun
Specification-Title : Java Architecture for XML Binding
Specification-Version : 2.2.2
Extension-Name : com.sun.xml.bind
Build-Id : hudson-jaxb-ri-2.2.3-3
Found JAXB reference implementation in jar:file:/Users/kushniry/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar!/META-INF/MANIFEST.MF
Implementation-Version : 2.2.3-hudson-jaxb-ri-2.2.3-3-
Attempting to load: org/docx4j/wml/jaxb.properties
Not using MOXy, since no resource: org/docx4j/wml/jaxb.properties
No MOXy JAXB config found; assume not intended..
org/docx4j/wml/jaxb.properties not found via classloader.
name: com.sun.xml.internal.bind.namespacePrefixMapper value: org.docx4j.jaxb.NamespacePrefixMapperSunInternal@2a3d4350 .. trying RI.
Using NamespacePrefixMapper, which is suitable for the JAXB RI
Using JAXB Reference Implementation
Not using MOXy; using com.sun.xml.bind.v2.runtime.JAXBContextImpl
.. other contexts loaded ..

Set contentType application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml on part /word/document.xml


Using paper size: A4
Landscape orientation: false

Set contentType application/vnd.openxmlformats-package.relationships+xml on part /_rels/.rels


setPackage called for org.docx4j.openpackaging.parts.relationships.RelationshipsPart
setPackage called for org.docx4j.openpackaging.parts.relationships.RelationshipsPart
Registered rels
adding part with proposed name: /word/document.xml

Relativising target /word/document.xml against source /
Result word/document.xml
rel exists: false


Loading part /word/document.xml

put part /word/document.xml

setPackage called for org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart
Set shortcut for mainDoc
shortcut was set

Set contentType application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml on part /word/styles.xml


docx4j.openpackaging.parts.WordprocessingML.StyleDefinitionsPart.DefaultStyles resolved to org/docx4j/openpackaging/parts/WordprocessingML/styles.xml
Attempting to load: org/docx4j/openpackaging/parts/WordprocessingML/styles.xml
For org.docx4j.openpackaging.parts.WordprocessingML.StyleDefinitionsPart, unmarshall via binder
Oracle Corporation
1.7.0_55
Using com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
Using com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
info: com.sun.xml.bind.v2.runtime.BinderImpl


Set contentType application/vnd.openxmlformats-package.relationships+xml on part /word/_rels/document.xml.rels



setPackage called for org.docx4j.openpackaging.parts.relationships.RelationshipsPart

setPackage called for org.docx4j.openpackaging.parts.relationships.RelationshipsPart

Registered rels

adding part with proposed name: /word/styles.xml

Relativising target /word/styles.xml against source /word/document.xml
Result styles.xml
rel exists: false


Loading part /word/styles.xml

put part /word/styles.xml

setPackage called for org.docx4j.openpackaging.parts.WordprocessingML.StyleDefinitionsPart
shortcut was set
xpath implementation: org.apache.xpath.jaxp.XPathFactoryImpl

Set contentType application/vnd.openxmlformats-package.core-properties+xml on part /docProps/core.xml


adding part with proposed name: /docProps/core.xml

Relativising target /docProps/core.xml against source /
Result docProps/core.xml
rel exists: false


Loading part /docProps/core.xml

put part /docProps/core.xml

setPackage called for org.docx4j.openpackaging.parts.DocPropsCorePart
Set shortcut for docPropsCorePart
shortcut was set

Set contentType application/vnd.openxmlformats-officedocument.extended-properties+xml on part /docProps/app.xml


adding part with proposed name: /docProps/app.xml

Relativising target /docProps/app.xml against source /
Result docProps/app.xml
rel exists: false


Loading part /docProps/app.xml

put part /docProps/app.xml

setPackage called for org.docx4j.openpackaging.parts.DocPropsExtendedPart
Set shortcut for docPropsExtendedPart
shortcut was set

Set contentType application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml on part /word/numbering.xml




adding part with proposed name: /word/numbering.xml

Relativising target /word/numbering.xml against source /word/document.xml
Result numbering.xml
rel exists: false


Loading part /word/numbering.xml

put part /word/numbering.xml

setPackage called for org.docx4j.openpackaging.parts.WordprocessingML.NumberingDefinitionsPart
shortcut was set
docx4j.openpackaging.parts.WordprocessingML.NumberingDefinitionsPart.DefaultNumbering resolved to org/docx4j/openpackaging/parts/WordprocessingML/numbering.xml
Attempting to load: org/docx4j/openpackaging/parts/WordprocessingML/numbering.xml
For org.docx4j.openpackaging.parts.WordprocessingML.NumberingDefinitionsPart, unmarshall via binder
info: com.sun.xml.bind.v2.runtime.BinderImpl
tableFormatting: CLASS_PLUS_OTHER
paragraphFormatting: CLASS_PLUS_OTHER
runFormatting: CLASS_PLUS_OTHER
Attempting to load: docx4j-ImportXHTML.properties
Preparing StyleTree
Style with name Normal, id 'Normal' is default paragraph style
Set virtual style, id 'DocDefaults', name 'DocDefaults'
setProperty: com.sun.xml.bind.namespacePrefixMapper
<w:style w:type="paragraph" w:styleId="DocDefaults" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:ns21="urn:schemas-microsoft-com:office:powerpoint" xmlns:ns23="http://schemas.microsoft.com/office/2006/coverPageProps" xmlns:dsp="http://schemas.microsoft.com/office/drawing/2008/diagram" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:odx="http://opendope.org/xpaths" xmlns:odgm="http://opendope.org/SmartArt/DataHierarchy" xmlns:dgm="http://schemas.openxmlformats.org/drawingml/2006/diagram" xmlns:ns17="urn:schemas-microsoft-com:office:excel" xmlns:c="http://schemas.openxmlformats.org/drawingml/2006/chart" xmlns:odi="http://opendope.org/components" xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" xmlns:ns9="http://schemas.openxmlformats.org/schemaLibrary/2006/main" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:ns32="http://schemas.openxmlformats.org/drawingml/2006/lockedCanvas" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture" xmlns:ns30="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" xmlns:ns12="http://schemas.openxmlformats.org/drawingml/2006/chartDrawing" xmlns:ns31="http://schemas.openxmlformats.org/drawingml/2006/compatibility" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:odq="http://opendope.org/questions" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:odc="http://opendope.org/conditions" xmlns:oda="http://opendope.org/answers">
    <w:name w:val="DocDefaults"/>
    <w:pPr>
        <w:spacing w:after="200" w:line="276" w:lineRule="auto"/>
    </w:pPr>
    <w:rPr>
        <w:rFonts w:asciiTheme="minorHAnsi" w:hAnsiTheme="minorHAnsi" w:eastAsiaTheme="minorHAnsi" w:cstheme="minorBidi"/>
        <w:sz w:val="22"/>
        <w:szCs w:val="22"/>
        <w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/>
    </w:rPr>
</w:style>
Style with name Default Paragraph Font, id 'DefaultParagraphFont' is default character style
getting children of java.util.ArrayList


No numPr.. 
200 twips -> 3.5250988mm (0.14inches)

 /* TABLE STYLES */ 

 /* PARAGRAPH STYLES */ 
.DocDefaults {display:block;margin-bottom: 4mm;line-height: 115%;font-size: 11.0pt;}

 /* CHARACTER STYLES */ 
org.docx4j.org.xhtmlrenderer.load INFO:: SAX XMLReader in use (parser): org.apache.xerces.parsers.SAXParser
org.docx4j.org.xhtmlrenderer.load INFO:: SAX XMLReader in use (parser): org.apache.xerces.parsers.SAXParser
org.docx4j.org.xhtmlrenderer.load INFO:: SAX XMLReader in use (parser): org.apache.xerces.parsers.SAXParser
org.docx4j.org.xhtmlrenderer.load INFO:: SAX XMLReader in use (parser): org.apache.xerces.parsers.SAXParser
org.docx4j.org.xhtmlrenderer.load INFO:: SAX XMLReader in use (parser): org.apache.xerces.parsers.SAXParser
org.docx4j.org.xhtmlrenderer.load INFO:: SAX XMLReader in use (parser): org.apache.xerces.parsers.SAXParser
org.docx4j.org.xhtmlrenderer.load INFO:: Loaded document in ~91ms
org.docx4j.org.xhtmlrenderer.load INFO:: TIME: parse stylesheets  170ms
org.docx4j.org.xhtmlrenderer.match INFO:: media = print
org.docx4j.org.xhtmlrenderer.match INFO:: Matcher created with 136 selectors
org.docx4j.org.xhtmlrenderer.render.BlockBox
BB<html color: #000000; background-color: transparent; background-image: none; background-repeat: repeat; background-attachment: scroll; background-position: [0%, 0%]; background-size: [auto, auto]; border-collapse: separate; -fs-border-spacing-horizontal: 0; -fs-border-spacing-vertical: 0; -fs-font-metric-src: none; -fs-keep-with-inline: auto; -fs-page-width: auto; -fs-page-height: auto; -fs-page-sequence: auto; -fs-pdf-font-embed: auto; -fs-pdf-font-encoding: Cp1252; -fs-page-orientation: auto; -fs-table-paginate: auto; -fs-text-decoration-extent: line; bottom: auto; caption-side: top; clear: none; ; content: normal; counter-increment: none; counter-reset: none; cursor: auto; ; display: block; empty-cells: show; float: none; font-style: normal; font-variant: normal; font-weight: normal; font-size: medium; line-height: normal; font-family: serif; -fs-table-cell-colspan: 1; -fs-table-cell-rowspan: 1; height: auto; left: auto; letter-spacing: normal; list-style-type: disc; list-style-position: outside; list-style-image: none; max-height: none; max-width: none; min-height: 0; min-width: 0; orphans: 2; ; ; ; overflow: visible; page: auto; page-break-after: auto; page-break-before: auto; page-break-inside: auto; position: static; ; right: auto; src: none; table-layout: auto; text-align: left; text-decoration: none; text-indent: 0; text-transform: none; top: auto; ; vertical-align: baseline; visibility: visible; white-space: normal; word-wrap: normal; widows: 2; width: auto; word-spacing: normal; z-index: auto; border-top-color: #000000; border-right-color: #000000; border-bottom-color: #000000; border-left-color: #000000; border-top-style: none; border-right-style: none; border-bottom-style: none; border-left-style: none; border-top-width: 2px; border-right-width: 2px; border-bottom-width: 2px; border-left-width: 2px; margin-top: 0; margin-right: 0; margin-bottom: 0; margin-left: 0; padding-top: 0; padding-right: 0; padding-bottom: 0; padding-left: 0; 
block
default handling for html
How to handle: border-bottom-width?
How to handle: text-indent?
How to handle: cursor?
How to handle: visibility?
How to handle: border-right-style?
How to handle: font-weight?
How to handle: float?
How to handle: border-bottom-style?
How to handle: height?
How to handle: background-size?
How to handle: page?
How to handle: border-right-color?
How to handle: border-right-width?
How to handle: white-space?
How to handle: right?
How to handle: background-image?
How to handle: background-position?
How to handle: padding-right?
How to handle: widows?
How to handle: max-height?
How to handle: width?
How to handle: display?
How to handle: min-height?
How to handle: padding-bottom?
How to handle: content?
How to handle: border-left-color?
How to handle: border-top-color?
How to handle: background-attachment?
How to handle: border-left-style?
How to handle: overflow?
valueType PRIMITIVE for margin-left
PrimitiveType: 1
margin-left: 0.0
How to handle: bottom?
How to handle: page-break-inside?
How to handle: margin-top?
How to handle: empty-cells?
How to handle: caption-side?
How to handle: background-repeat?
How to handle: list-style-position?
How to handle: position?
How to handle: border-top-style?
How to handle: counter-reset?
valueType PRIMITIVE for text-align
PrimitiveType: 21
How to handle: counter-increment?
valueType PRIMITIVE for page-break-after
PrimitiveType: 21
How to handle: clear?
How to handle: margin-right?
valueType PRIMITIVE for line-height
PrimitiveType: 21
How to handle: border-collapse?
How to handle: font-size?
How to handle: left?
How to handle: word-wrap?
How to handle: src?
How to handle: border-left-width?
How to handle: word-spacing?
How to handle: top?
How to handle: padding-left?
How to handle: padding-top?
How to handle: list-style-type?
How to handle: letter-spacing?
How to handle: font-variant?
...............



..............

How to handle: font-family?
valueType PRIMITIVE for page-break-before
PrimitiveType: 21
No mapping for: 'serif'
.. processed child org.docx4j.org.xhtmlrenderer.render.InlineBox
Done processing children of org.docx4j.org.xhtmlrenderer.render.BlockBox
.. processed child org.docx4j.org.xhtmlrenderer.render.BlockBox
Done processing children of org.docx4j.org.xhtmlrenderer.render.BlockBox
.. processed child org.docx4j.org.xhtmlrenderer.render.BlockBox
Done processing children of org.docx4j.org.xhtmlrenderer.render.BlockBox
sourcePartStore undefined
setProperty: com.sun.xml.bind.namespacePrefixMapper
marshalling org.docx4j.openpackaging.contenttype.ContentTypeManager ...
marshalling /_rels/.rels
name: com.sun.xml.internal.bind.namespacePrefixMapper value: org.docx4j.jaxb.NamespacePrefixMapperRelationshipsPartSunInternal@7bf8dc3c .. trying RI.
Using NamespacePrefixMapperRelationshipsPart, which is suitable for the JAXB RI
setProperty: com.sun.xml.bind.namespacePrefixMapper
marshalling org.docx4j.openpackaging.parts.relationships.RelationshipsPart
For Relationship Id=rId1 Source is /, Target is word/document.xml
Getting part /word/document.xml

org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart


.. saving 
marshalling /word/document.xml
setProperty: com.sun.xml.bind.namespacePrefixMapper
marshalling org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart
marshalling /word/_rels/document.xml.rels
setProperty: com.sun.xml.bind.namespacePrefixMapper
marshalling org.docx4j.openpackaging.parts.relationships.RelationshipsPart
For Relationship Id=rId1 Source is /word/document.xml, Target is styles.xml
Getting part /word/styles.xml

org.docx4j.openpackaging.parts.WordprocessingML.StyleDefinitionsPart


.. saving 
marshalling /word/styles.xml
setProperty: com.sun.xml.bind.namespacePrefixMapper
marshalling org.docx4j.openpackaging.parts.WordprocessingML.StyleDefinitionsPart
For Relationship Id=rId2 Source is /word/document.xml, Target is numbering.xml
Getting part /word/numbering.xml

org.docx4j.openpackaging.parts.WordprocessingML.NumberingDefinitionsPart


.. saving 
marshalling /word/numbering.xml
setProperty: com.sun.xml.bind.namespacePrefixMapper
marshalling org.docx4j.openpackaging.parts.WordprocessingML.NumberingDefinitionsPart
For Relationship Id=rId2 Source is /, Target is docProps/core.xml
Getting part /docProps/core.xml

org.docx4j.openpackaging.parts.DocPropsCorePart


.. saving 
marshalling /docProps/core.xml
setProperty: com.sun.xml.bind.namespacePrefixMapper
marshalling org.docx4j.openpackaging.parts.DocPropsCorePart
For Relationship Id=rId3 Source is /, Target is docProps/app.xml
Getting part /docProps/app.xml

org.docx4j.openpackaging.parts.DocPropsExtendedPart


.. saving 
marshalling /docProps/app.xml
setProperty: com.sun.xml.bind.namespacePrefixMapper
marshalling org.docx4j.openpackaging.parts.DocPropsExtendedPart
...Done!

我有什么办法可以解决这个问题,以正确的样式转换文档? 我的设置

docx4j.AppVersion=3.3

 <dependency>
    <groupId>org.docx4j</groupId>
    <artifactId>docx4j</artifactId>
    <version>3.2.1</version>
 </dependency>
<dependency>
    <groupId>org.docx4j</groupId>
    <artifactId>docx4j-ImportXHTML</artifactId>
    <version>3.2.1</version>
</dependency>

这是 PropertyFactory 中的 DEBUG 级别日志记录,旨在告诉开发人员哪些 CSS 属性当前是 ignored/unsupported。

另外请注意,如果它们匹配 @class 值,您可以在目标 docx 中使用样式。这是在段落、运行 和 table 级别单独配置的。