如何将 Markdown 字符串转换为 Python 中的 DocX？

Question

我正在从我的 API 中获取降价文本，如下所示：

{
    name:'Onur',
    surname:'Gule',
    biography:'## Computers
    I like **computers** so much.
    I wanna *be* a computer.',
    membership:1
}

传记列包含如上所示的降价字符串。

## Computers
I like **computers** so much.
I wanna *be* a computer.

我想将此降价文本转换为 docx 字符串用于我的报告。

在我的 docx 模板中：

{{markdownText|mark2html}}

{{simpleText}}

我正在使用 python3 docxtpl 包来创建 docx，它适用于简单的文本。

我尝试 BeautifulSoup 将 markdown 转换为 docx 文本，但它不适用于样式（粗体、斜体等）。
我尝试了 pandoc 并且它有效但它只是创建了一个 docx 文件，我想将渲染的 markdown 文本添加到现有的 docx（创建时）。

我当前的代码：

import docx
from docxtpl import DocxTemplate, RichText
import markdown
import jinja2
import markupsafe
from bs4 import BeautifulSoup
import pypandoc

def safe_markdown(text):
    return markupsafe.Markup(markdown.markdown(text))

def mark2html(value):
    html = markdown.markdown(value)
    soup = BeautifulSoup(html, features='html.parser')
    output = pypandoc.convert_text(value,'rtf',format='md')
    return RichText(value) #tried soup and pandoc..

def from_template(template):
    template = DocxTemplate(template)
    context = {
        'simpleText':'Simple text test.',
        'markdownText':'Markdown **text** test.'
    } 
    jenv = jinja2.Environment()
    jenv.filters['markdown'] = safe_markdown
    jenv.filters["mark2html"] = mark2html
    template.render(context,jenv)
    template.save('new_report.docx')

那么，我如何将呈现的 markdown 添加到现有的 docx 或在创建时，也许使用 jinja2 过滤器？

Answer 1

我遵循了一个懒惰的、效率不高但有用的策略。由于处理 docx 不如 html 灵活，我先将降价 md 转换为 html，然后从 html 移动到 docx 之类的这个：

from jinja2 import FileSystemLoader, Environment
from pypandoc import convert_file, convert_text

def md2html(md):
  return convert_text(md, 'html', format='md')

def html2docx(file):
  return convert_file(f'{file}.html', 'docx', format='html', outputfile=f'{file}.docx')

def from_template(template_file, f_out):
  context = {
      'simpleText': 'Simple text test.',
      'markdownText': 'Markdown **text** test.'
  }
  ldr = FileSystemLoader(searchpath='./')
  jenv = Environment(loader=ldr)
  jenv.filters["md2html"] = md2html
  template = jenv.get_template(template_file)
  html = template.render(context)
  print(html)
  with open(f'{f_out}.html', 'w') as fout:
    fout.write(html)
    fout.close()
  html2docx(f_out)

if __name__ == '__main__':
  from_template('template.html.jinja', 'new_report')

至于模板的内容，应该取自html类的，像这样：

<!DOCTYPE html>
<html xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
  <head></head>
  <body>
    {{markdownText|md2html}}
    {{simpleText}}
  </body>
</html>

我保存为template.html.jinja。

我很想研究@Mahrkeenerh 的贡献，那里提到的 API 似乎有很多项目需要学习和理解。

Answer 2

我没有捷径就解决了。我用 beautifulSoup 将降价转为 html，然后通过检查他们的标签名称来处理每个段落。

在我的word模板中：

{% if markdownText != None %}
    {% for mt in markdownText|mark2html %} 
        {{mt}}
    {% endfor %}
{% endif %}

我的模板标签：

def mark2html(value):
    if value == None:
        return '-'
    html = markdown.markdown(value)
    soup = BeautifulSoup(html, features='html.parser')
    paragraphs = []
    global doc
    for tag in soup.findAll(True):
        if tag.name in ('p','h1','h2','h3','h4','h5','h6'):
            paragraphs.extend(parseHtmlToDoc(tag))  
    return paragraphs

我插入 docx 的代码：

def parseHtmlToDoc(org_tag):
    contents = org_tag.contents
    pars= []
    for con in contents:
        if str(type(con)) == "<class 'bs4.element.Tag'>":
            tag = con
            if tag.name in ('strong',"h1","h2","h3","h4","h5","h6"):
                source = RichText("")
                if len(pars) > 0 and str(type(pars[len(pars)-1])) == "<class 'docxtpl.richtext.RichText'>":
                    source = pars[len(pars)-1]
                    source.add(con.contents[0], bold=True)
                else:
                    source.add(con.contents[0], bold=True)
                    pars.append(source) 
            elif tag.name == 'img':
                source = tag['src']
                imagen = InlineImage(doc, settings.MEDIA_ROOT+source)
                pars.append(imagen)
            elif tag.name == 'em':
                source = RichText("")
                source.add(con.contents[0], italic=True)
                pars.append(source)
        else:
            source = RichText("")
            if len(pars) > 0 and str(type(pars[len(pars)-1])) == "<class 'docxtpl.richtext.RichText'>":
                    source = pars[len(pars)-1]
                    pars.add(con)
            else:
                if org_tag.name == 'h2':
                    source.add(con,bold=True,size=40)
                else:
                    source.add(con)
                pars.append(source) # her zaman append?
    return pars

它处理 html 个标签，例如 b、i、img、headers。您可以添加更多标签进行处理。我这样解决了，它不需要任何额外的文件转换，如 html2docx 等

如何将 Markdown 字符串转换为 Python 中的 DocX？

How can I convert a Markdown string to a DocX in Python?

html

python

markdown

docx

jinja2