在不丢失任何格式的情况下将 PDF 转换为 HTML

Convert PDF to HTML without losing any format

我正在开发一个 Python Flask 网络应用程序，我正在尝试将一些用户上传的 pdf 转换为格式良好的 HTML，就像您在制作时生成的 HTML在 iframe.

中显示 pdf

到目前为止我尝试了几种方法：

pdfminer.six库，制作凌乱HTML，
在使用 pdf.js 渲染 PDF 时试图获取生成的 HTML，它显然隐藏在 Shadow DOM 中，无法访问其内部 HTML
我终于遇到了 pdf2htmlEX (https://github.com/pdf2htmlEX/pdf2htmlEX)，它产生了我想要的结果。

在本地，此解决方案运行良好，但在生产状态 (Heroku) 中我无法正确安装它。该项目已被弃用，文档有限且糟糕。该问题与损坏的依赖关系有关。

那么，如何使用 Python 或任何其他工具有效地将 PDF 转换为 HTML 而不会丢失任何格式？

非常感谢。

如果有人愿意帮助我让 pdf2htmlEX 在 heroku 上工作，请发表评论，我将 post 在另一个 post 中提供更多详细信息

这不是一件小事。但我会给出一些指示。

您需要一个 app.json 来定义您的构建包。
https://devcenter.heroku.com/articles/app-json-schema#buildpacks

如果this project is available via apt it's going to be easy. You just use the Heroku's Apt buildpack define an Aptfile that says which packages it needs to install. Example
然后它会自动安装它，你就完成了。

如果它不能作为包提供，您将需要创建自己的构建包。
https://devcenter.heroku.com/articles/buildpack-api
Example used here.

另一种解决方案是 docker 化您的项目并将其作为 docker 容器执行。

在不丢失任何格式的情况下将 PDF 转换为 HTML

Convert PDF to HTML without losing any format

html

python

pdf

heroku

pdf2htmlex