通过 python 的 pandoc 库从 markdown 获取 h1

Getting h1 from markdown via python's pandoc library

我正在编写一个 python 批处理脚本来处理许多降价文件以获得类似 h1 的文本以生成 'title' 元数据变量(我忘记将 'title' 添加到 frontmatter 中) .我没有将其用作 pandoc 过滤器。

因此我想通过 pandoc-python 处理这些文件,但我对此不熟悉,我不知道如何只获取 h1。

content = pandoc.read(post.content)

'content' 是 pandoc 原生格式。我看到了这样的东西

(Pdb) content                                                                                                                                                                                                                                 
Pandoc(Meta({}), [Header(1, ('foobar', [], []), [Str('foobar:')]), Para(...

我想将 h1 设为简单文本。

人们也可以尝试配置 pandoc 来为我们做这件事。以下是手册中关于 --shift-heading-level-by 选项的内容:

--shift-heading-level-by=NUMBER

Shift heading levels by a positive or negative integer. For example, with --shift-heading-level-by=-1, level 2 headings become level 1 headings, and level 3 headings become level 2 headings. Headings cannot have a level less than 1, so a heading that would be shifted below level 1 becomes a regular paragraph. Exception: with a shift of -N, a level-N heading at the beginning of the document replaces the metadata title. --shift-heading-level-by=-1 is a good choice when converting HTML or Markdown documents that use an initial level-1 heading for the document title and level-2+ headings for sections. --shift-heading-level-by=1 may be a good choice for converting Markdown documents that use level-1 headings for sections to HTML, since pandoc uses a level-1 heading to render the document title.

所以 运行 带有 --shift-heading-level-by=-1 的 pandoc 可能足以满足您的需求。

我有以下代码片段适用于 headers 和 #=======

import pandoc
from pandoc.types import *

with open('README.md') as f:
    content = pandoc.read(f.read()) 
# But you can use your content.
headers = []

for elt in pandoc.iter(content):
     if isinstance(elt, Header):
         if elt[0] == 1: # this is header 1, remove this if statement if you want all headers.
             headers.append(elt[1][0])

或者,如果您想要包含大写字母等的确切字符串:

for elt in pandoc.iter(content):
    if isinstance(elt, Header):
        if elt[0] == 1: # this is header 1, remove this if statement if you want all headers.
            header.append(pandoc.write(elt[-1]).strip())