bash 中的字符串拆分部分

Question

我有一些非常长的文本文件 (> 300-500 MB) 并且有数千行，例如：

blavbl
[code]
sdasdasd
asdasd
...
[/code]

line X
line Y
etc
...

[code]
...
[/code]

blabla

[code]

[/code]

我想将文本分成包含 [code] 和 [/code] 之间的字符串的片段，我有以下代码可以完成工作（部分）但速度很慢：

#!/bin/bash

function split {
        file=""
        start=""
        end=""

        nfodata=$(cat "$file")
        IFS=$'\n' read -d '' -a nfoarray <<< "$nfodata"

        arr=()
        start=0

        for line in "${nfoarray[@]}"
        do
                if [[ "$line" =~ ^"$start" ]]; then
                        arr+=("$line")
                        start=1
                        continue
                fi

                if [[ "$line" =~ ^"$end" ]]; then
                        start=0
                        break
                fi

                if [[ $start == 1 ]]; then
                        arr+=("$line")
                        continue
                fi
        done

        printf "%s\n" "${arr[@]}"
}

split $myfile "[code]" "[/code]"

如我所写，非常慢，不知道是更好还是更快的方法。

最终结果是一个数组，其中包含 [code] 和 [/code] 之间的字符串部分

Answer 1

使用 sed：

sed '/^\[code\]$/,/^\[\/code\]$/!d;//d'

使用 awk：

awk  '
/^\[\/code\]$/ {--c} c
/^\[code\]$/ {++c}'

这些方法中的任何一种都需要标签模式干净地交替 - 没有嵌套、重复或未闭合的标签。

这将打印标签内的所有行，不包括标签。例如：

sdasdasd
asdasd
...
...
<empty line>

bash 中的字符串拆分部分

Split portion of string in bash

bash