Python 用于从降价文档中获取章节内容的脚本

Question

我尝试在 python 中编写一个脚本，用于在一个 markdown 文档的各个部分中划分内容。例如，在

# Section 1

Hello

# Section 2

Bla la dsds

# Section 3 #

Ssss

## Subsection ##

aaaa

我想得到：

contents = ['# Section 1\n\nHello\n', '# Section 2\n\nBla la dsds\n', '# Section 3 #\n\nSsss\n\n## Subsection ##\n\naaaa']

我该怎么做？

Answer 1

我对 Markdown 不是很了解，但我会试一试。

markdown文件只是一个txt文件，所以你可以这样加载它：

file = open('markdownfile.md','r')
data = file.read()
file.close()

看起来你想要拆分的公因数是 "\n#"，但也没有跟随，而是另一个 "#" 或者只是不 "\n##"

所以我能看到的一种方法是将文件拆分 "\n#" 然后修复小节：

splitData = data.split("\n#")
for i in xrange(len(splitData)-1,-1,-1):#going backwards
    if splitData[i][0] == '#':#subsection
        splitData[i-1] += '\n#'+splitData.pop(i)#being sure to add back what we remove from the .split
    else:#section
        splitData[i] = '#'+splitData[i]#adding back the wanted part removed with the .split

或者您可以遍历字符并进行手动拆分

contents = []
for i in xrange(len(data)-1-3,-1,-1):
    if data[i:i+2] == '\n#' and data[i:i+3] != '\n##'
        contents.append(data[i+1:])#append the section
        data = data[:i]#remove from data
contents.reverse()

希望对您有所帮助。

编辑： 你不能只用 "\n# " 拆分 data （最后是 space 因为（通过我的研究） space 不必在那里，因为它会被识别为一个部分 header。（例如 #Section 1 仍然有效）

Answer 2

def get_sections(s):
    for sec in s.split('\n# '):
        yield sec if sec.startswith('# ') else '# '+sec

contents = """# Section 1

Hello

# Section 2

Bla la dsds

# Section 3 #

Ssss

## Subsection ##

aaaa"""

for i,sec in enumerate(get_sections(contents)):
    print(i,sec)

Python 用于从降价文档中获取章节内容的脚本

Python script for getting sections contents from a markdown document

python

markdown

split