使用 OpenXML 从 HTML 文件生成 docx 文件

Generating docx file from HTML file using OpenXML

我正在使用此方法生成 docx 文件:

public static void CreateDocument(string documentFileName, string text)
{
    using (WordprocessingDocument wordDoc =
        WordprocessingDocument.Create(documentFileName, WordprocessingDocumentType.Document))
    {
        MainDocumentPart mainPart = wordDoc.AddMainDocumentPart();

        string docXml =
                    @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?>
                 <w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
                 <w:body><w:p><w:r><w:t>#REPLACE#</w:t></w:r></w:p></w:body>
                 </w:document>";

        docXml = docXml.Replace("#REPLACE#", text);

        using (Stream stream = mainPart.GetStream())
        {
            byte[] buf = (new UTF8Encoding()).GetBytes(docXml);
            stream.Write(buf, 0, buf.Length);
        }
    }
}

它就像一个魅力:

CreateDocument("test.docx", "Hello");

但是如果我想放置 HTML 内容而不是 Hello 怎么办?例如:

CreateDocument("test.docx", @"<html><head></head>
                              <body>
                                    <h1>Hello</h1>
                              </body>
                        </html>");

或者像这样:

CreateDocument("test.docx", @"Hello<BR>
                                    This is a simple text<BR>
                                    Third paragraph<BR>
                                    Sign
                        ");

这两种情况都为 document.xml 创建了一个无效结构。 任何的想法?如何从 HTML 内容生成 docx 文件?

您不能只将 HTML 内容插入 "document.xml",这部分只需要 WordprocessingML 内容,因此您必须将 HTML 转换为 WordprocessingML,see this.

您可以使用的另一件事是 altChunk 元素,使用它您可以在 DOCX 文件中放置一个 HTML 文件,然后在您的文件中的某个特定位置引用该 HTML 内容文档,see this.

最后作为替代方案,使用 GemBox.Document library 您可以完全实现您想要的,请参阅以下内容:

public static void CreateDocument(string documentFileName, string text)
{
    DocumentModel document = new DocumentModel();
    document.Content.LoadText(text, LoadOptions.HtmlDefault);
    document.Save(documentFileName);
}

或者您实际上可以直接将 HTML 内容转换为 DOCX 文件:

public static void Convert(string documentFileName, string htmlText)
{
    HtmlLoadOptions options = LoadOptions.HtmlDefault;
    using (var htmlStream = new MemoryStream(options.Encoding.GetBytes(htmlText)))
        DocumentModel.Load(htmlStream, options)
                     .Save(documentFileName);
}

我意识到我迟到了 7 年。尽管如此,对于未来搜索如何从 HTML 转换为 Word Doc 的人来说,他包含的 this blog posting on a Microsoft MSDN site gives most of the ingredients necessary to do this using OpenXML. I found the post itself to be confusing, but the source 代码为我阐明了这一切。

唯一缺少的部分是如何从头开始构建 Docx 文件,而不是如他的示例所示如何合并到现有文件中。我从 here.

中找到了那个花絮

不幸的是我用这个的项目是用vb.net写的。所以我将首先分享 vb.net 代码,然后是它的自动 C# 转换,这可能准确也可能不准确。

vb.net代码:

Imports DocumentFormat.OpenXml
Imports DocumentFormat.OpenXml.Packaging
Imports DocumentFormat.OpenXml.Wordprocessing
Imports System.IO

Dim ms As IO.MemoryStream
Dim mainPart As MainDocumentPart
Dim b As Body
Dim d As Document
Dim chunk As AlternativeFormatImportPart
Dim altChunk As AltChunk

Const altChunkID As String = "AltChunkId1"

ms = New MemoryStream()

Using myDoc = WordprocessingDocument.Create(ms,WordprocessingDocumentType.Document)
    mainPart = myDoc.MainDocumentPart

    If mainPart Is Nothing Then
        mainPart = myDoc.AddMainDocumentPart()

        b = New Body()
        d = New Document(b)
        d.Save(mainPart)
    End If

    chunk = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Xhtml, altChunkID)

    Using chunkStream As Stream = chunk.GetStream(FileMode.Create, FileAccess.Write)
        Using stringStream As StreamWriter = New StreamWriter(chunkStream)
            stringStream.Write("YOUR HTML HERE")
        End Using
    End Using

    altChunk = New AltChunk()
    altChunk.Id = altChunkID
    mainPart.Document.Body.InsertAt(Of AltChunk)(altChunk, 0)
    mainPart.Document.Save()
End Using

c#代码:

using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
using System.IO;

IO.MemoryStream ms;
MainDocumentPart mainPart;
Body b;
Document d;
AlternativeFormatImportPart chunk;
AltChunk altChunk;

string altChunkID = "AltChunkId1";

ms = new MemoryStream();

Using (myDoc = WordprocessingDocument.Create(ms, WordprocessingDocumentType.Document))
{
    mainPart = myDoc.MainDocumentPart;

    if (mainPart == null) 
    {
         mainPart = myDoc.AddMainDocumentPart();
         b = new Body();
         d = new Document(b);
         d.Save(mainPart);
    }

    chunk = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Xhtml, altChunkID);

    Using (Stream chunkStream = chunk.GetStream(FileMode.Create, FileAccess.Write)
    {
         Using (StreamWriter stringStream = new StreamWriter(chunkStream))         
         {
              stringStream.Write("YOUR HTML HERE");
         }
    }    

    altChunk = new AltChunk();
    altChunk.Id = altChunkID;
    mainPart.Document.Body.InsertAt(Of, AltChunk)[altChunk, 0];
    mainPart.Document.Save();
}

请注意,我在另一个例程中使用了 ms 内存流,这是它在使用后被丢弃的地方。

我希望这对其他人有帮助!

我可以使用此代码在 .net Core 中使用 OpenXML 将 HTML 内容成功转换为 docx 文件

string html = "<strong>Hello</strong> World";
using (MemoryStream generatedDocument = new MemoryStream()){
   using (WordprocessingDocument package = 
                  WordprocessingDocument.Create(generatedDocument,
                  WordprocessingDocumentType.Document)){
   MainDocumentPart mainPart = package.MainDocumentPart;
   if (mainPart == null){
    mainPart = package.AddMainDocumentPart();
    new Document(new Body()).Save(mainPart);
}
HtmlConverter converter = new HtmlConverter(mainPart);
converter.ParseHtml(html);
mainPart.Document.Save();
}

保存在磁盘上

System.IO.File.WriteAllBytes("filename.docx", generatedDocument.ToArray());

到returnnet core mvc中下载的文件,使用

return File(generatedDocument.ToArray(), 
          "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
          "filename.docx");