无法使用 Python 将 DOCX 转换为 HTML

Question

我已经用 mammoth 试过了：

import mammoth

result = mammoth.convert_to_html("MyDocument.docx")
print (result.value)

我没有得到 HTML，但是这个奇怪的代码：

kbW7yqZoo4h9pYM6yBxX1QFx2pCoPYflXfieIPbtqpT913Vk7OzcZdEk3eO7TbWjvZNTGilsfmRrPwDvB[...]

我也尝试过使用docx2html，但我无法安装它。当我运行 pip install docx2html 我得到这个错误：

SyntaxError: Missing parentheses in call to 'print'

Answer 1

如 documentation 中所述：

To convert an existing .docx file to HTML, pass a file-like object to mammoth.convert_to_html. The file should be opened in binary mode. For instance:

import mammoth

with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value # The generated HTML
    messages = result.messages # Any messages, such as warnings during conversion

Answer 2

Mammoth .docx to HTML converter

Mammoth 旨在转换 .docx 文档，例如由 Microsoft Word 创建的文档，并将它们转换为 HTML。 Mammoth 旨在通过使用文档中的语义信息并忽略其他细节来生成简单干净的 HTML。例如，Mammoth 将任何样式为 Heading 1 的段落转换为 h1 元素，而不是尝试完全复制标题的样式（字体、文本大小、颜色等）。

.docx 使用的结构与 HTML 的结构之间存在很大的不匹配，这意味着对于更复杂的文档，转换不太可能是完美的。如果您只使用样式来语义标记您的文档，Mammoth 效果最好。

目前支持以下功能：

标题。
列表。
从您自己的 docx 样式到 HTML 的自定义映射。例如，您可以通过提供适当的样式映射将 WarningHeading 转换为 h1.warning。
表格。 table 本身的格式，例如边框，目前被忽略，但文本的格式与文档的其余部分一样。
脚注和尾注。
图片。
粗体、斜体、下划线、删除线、上标和下标。
链接。
换行。
文本框。文本框的内容被视为单独的段落，出现在包含文本框的段落之后。
评论。

安装

pip install mammoth

基本转换

要将现有的 .docx 文件转换为 HTML，请将 file-like object 传递给 mammoth.convert_to_html。该文件应以二进制模式打开。例如：

import mammoth

with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value # The generated HTML
    messages = result.messages # Any messages, such as warnings during conversion

您还可以使用 mammoth.extract_raw_text 提取文档的原始文本。这将忽略文档中的所有格式。每个段落后跟两个换行符。

with open("document.docx", "rb") as docx_file:
    result = mammoth.extract_raw_text(docx_file)
    text = result.value # The raw text
    messages = result.messages # Any messages

Answer 3

您可以为此目的使用 pypandoc 模块。见下面代码

导入pypandoc 输出 = pypandoc.convert_file('file.docx', 'docx', outputfile="file_converted.html")

Answer 4

您遇到的问题可能是 mammoth 不会创建合法的 HTML 文件，只会创建 HTML 片段。这意味着它缺少和标签。一些浏览器仍然可以呈现文件中的内容，因为它们已经足够先进了，但是我运行在尝试使用原始输出时遇到了类似的问题。一个很好的解决方法是将其添加到您的代码中以将其转换为正确的 HTML 文件：

import mammoth

with open("test.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value  # The generated HTML
    messages = result.messages  # Any messages,

    full_html = (
        '<!DOCTYPE html><html><head><meta charset="utf-8"/></head><body>'
        + html
        + "</body></html>"
    )

    with open("test.html", "w", encoding="utf-8") as f:
        f.write(full_html)

其中 test.html 是您为文档指定的标题。

这不是我的功劳，我也是在这里找到的，但是找不到来源post。

无法使用 Python 将 DOCX 转换为 HTML

Cannot convert DOCX to HTML with Python

html

ms-word

converter

python-3.x

mammoth