Pandoc

Question

我有一些 MS Word 文件 (docx)，我将它们转换成 markdown 文件。后来，这些降价文件被转换为 PDF 和 HTML 文件。所有的转换都是在 pandoc 的帮助下完成的。

当 word 文件转换为 Markdown 时，我的 python pandoc 过滤器需要从 AST 文件中获取图像的宽度和高度信息（以英寸为单位）。这工作正常我可以从 AST 获得这些信息。

{
    "t": "Image",
    "c": [
    [
        "",
        [],
        [
        ["width", "5.113165354330708in"],
        ["height", "3.063299212598425in"]
        ]
    ],
    [],
    ["media/image1.png", ""]
    ]
}

但它也需要使用pillow库获取实际图像，并从文件系统中获取图像大小（以像素为单位）和DPI信息进行一些计算。

但问题是，当我尝试在将 docx 转换为 markdown 时使用的 pandoc 过滤器中创建此降价图像 link 时，当我使用 python 包 pillow 获取图像时, 它说

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/mertcan.segmen/Desktop/doc/media/image1.png'

这可能意味着 pandoc 在执行 pandoc 过滤器之前不会从 Word 文件中提取图像。这是正常的吗？如果没有，关于如何实现我的想法有什么建议吗？

Answer 1

我找到了某种解决方法，我正在运行宁 pandoc --extract-media MyDocxFile.docx ./ 就在将我的 docx 转换为 markdown 之前。这只会将 docx 文件中的图像提取到媒体文件夹中，然后我运行我的 pandoc 命令进行转换。由于图像是之前提取的，所以我的过滤器可以访问它们。

Pandoc - Word 文件中的图像在过滤器执行时未提取到媒体文件夹中

Pandoc - Images in Word file are not extracted into media folder at the time of the filter execution