在 pypandoc (pandoc) 中将较大的 HTML 文件转换为 docx 的问题

Question

我的问题与 How to increase heap memory in pandoc execution? 有关，但添加了一个 Python 特定组件。

背景：我正在尝试自动生成科学报告。我已将数据写入 HTML 文件，我想使用 Pandoc.exe（文件转换程序）转换为 .docx Word 文档。我有处理带有图像 table 等的较小 HTML 文件的过程。该文件为 307KB。

当我尝试转换嵌入了多个图表的较大文件 (~4.5MB) 时，问题就出现了。我一直用pypandoc来转换，像这样：

import pypandoc
PANDOC_PATH = r"C:\Program Files\RStudio\bin\pandoc"

infile = savepath + os.sep + 'Results ' + name + '.html'
outfile = savepath + os.sep + 'Results ' + name + '.docx'

output = pypandoc.convert(source=infile, format='html', to='docx', \
outputfile=outfile, extra_args=["+RTS", "-K64m", "-RTS"])

但是我遇到了各种各样的错误。通常：

RuntimeError: Pandoc died with exitcode "2" during conversion: 
b"Stack space overflow: current size 33692 bytes.\nUse `+RTS -Ksize -RTS' to increase it.\n"

或者如果我将 -Ksize 的值提高到 256m，如下所示：

RuntimeError: Pandoc died with exitcode "1" during conversion: b'pandoc: out of memory\r\n'

谁能解释一下这里发生了什么，以及我可以解决这个困难的一些方法吗？我考虑过的一个解决方案是使我的图像小很多。我刚刚像这样缩小 (80 - 500KB) 原件，其中每个图像的宽度和高度取决于它的原始尺寸：

data_uri = base64.b64encode(open(formats[graph][0], 'rb').read()).decode('utf-8')

img_tag = ('<img src="data:image/jpg;base64,{0}" height='+formats[graph][2][0]+'
             width='+formats[graph][2][1]+'>').format(data_uri)

感谢您的帮助

Answer 1

非常感谢 user2407038 在这方面的帮助！

两次修复终于让我能够将较大的 HTML 文件转换为带有 pypandoc 的 docx 文件：

如建议的那样，第一个是

increasing the maximum size of the heap, e.g. add -M2GB to extra_args

即：

output = pypandoc.convert(source=infile, format='html', to='docx', outputfile=outfile, extra_args=["-M2GB", "+RTS", "-K64m", "-RTS"])

增加堆大小后，我还有第二个问题，所以我不确定解决方案是否有效。 Python 返回了如下错误消息：

RuntimeError: Pandoc died with exitcode "1" during conversion: b"pandoc: Cannot decode byte '\x91': Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream\n"

首先通过更改 html 文件的打开方式解决了这个问题。将 encoding 关键字参数设置为 'utf8' 允许转换工作：

report = open(savepath + os.sep + 'Results ' + name + '.html', 'w', encoding='utf8')

在 pypandoc (pandoc) 中将较大的 HTML 文件转换为 docx 的问题

Issues converting larger HTML files to docx in pypandoc (pandoc)

html

python

haskell

pandoc