通过定义标题的正则表达式拆分 Markdown 文本文件

Question

我正在尝试使用命令行程序将较大的文本文件拆分为多个块：

拆分定义的正则表达式模式
由该正则表达式模式中的捕获组定义的文件名

文本文件的格式为：

# Title

# 2020-01-01

Multi-line content
goes here

# 2020-01-02

Other multi-line content
goes here

输出应该是这两个文件，文件名和内容如下：

2020-01-01.md↓

# 2020-01-01

Multi-line content
goes here

2020-01-02.md↓

# 2020-01-02

Other multi-line content
goes here

我似乎无法正确理解所有标准。

用于拆分（分隔符）的正则表达式模式非常简单，类似于 ^# (2020-.*)$

要么我无法设置一个 multi-line 正则表达式模式，该模式越过 \n 个换行符并在下一次出现分隔符模式时停止。

或者我可以在正则表达式模式上用 csplit 拆分，但我不能用 (2020-.*)

中捕获的内容命名文件

与 awk split() 或 match() 相同，无法使其完全工作。

我正在寻找一个通用的解决方案，参数是定义块开头（例如 # 2020-01-01）和结尾（例如下一个日期标题 # 2020-01-02 或EOF)

Answer 1

使用 this regex，这里有一个 perl 可以做到这一点：

perl -0777 -nE 'while (/^\h*#\h*(2020.*)([\s\S]*?(?:(?=(^\h*#\h*2020.*))|\z))/gm) {
    open($fh, ">", .".md") or die $!;
    print $fh ;
    print $fh ;
    close $fh;
}' file

结果：

head 2020*
==> 2020-01-01.md <==
2020-01-01

Multi-line content
goes here


==> 2020-01-02.md <==
2020-01-02

Other multi-line content
goes here

Answer 2

在每个 Unix 机器上的任何 shell 中使用任何 awk：

$ awk '/^# [0-9]/{ close(out); out=".md" } out!=""{print > out}' file

$ head *.md
==> 2020-01-01.md <==
# 2020-01-01

Multi-line content
goes here


==> 2020-01-02.md <==
# 2020-01-02

Other multi-line content
goes here

如果 /^# [0-9]/ 不是一个合适的正则表达式，然后将其更改为您喜欢的任何内容，例如/^# [0-9]{4}(-[0-9]{2}){2}$/ 会更严格。 FWIW，但如果您没有要求，我根本不会为此使用正则表达式。我会用：

awk '(=="#") && (c++){ close(out); out=".md" } out!=""{print > out}' file

通过定义标题的正则表达式拆分 Markdown 文本文件

Split Markdown text file by regular expression that defines headings

unix

bash

awk

text-processing

unix-text-processing