bash 嵌套多个条件的文本解析

bash text parsing with multiple conditions nested

我有以下代码检查超过 10 个单词的行并在第一个逗号字符出现的地方拆分它们。它重复这个过程,所以所有超过 10 个单词和逗号的新拆分行也被拆分(最后没有超过 10 个单词和逗号的行)。

如何编辑此代码以执行以下操作:完成所有逗号拆分后(当前代码已经完成的操作),检查结果行是否超过 10 个单词并在第一个 "and " (with space) 出现了吗?

#!/usr/bin/env bash

input=input.txt
temp=$(mktemp ${input}.XXXX)
trap "rm -f $temp" 0

while awk '
  BEGIN { retval=1 }
  NF >= 10 && /, / {
    sub(/, /, ","ORS)
    retval=0
  }
  1
  END { exit retval }
' "$input" > "$temp"; do
  mv -v $temp $input
done

输入样本:

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10 Word11

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10, Word11 Word12 Word13 Word14 Word15 Word16 

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10 Word11 and Word12 Word13 Word14 Word15 

Word1 Word2 Word3 Word4 and Word5

期望的输出:

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9

Word1 Word2 Word3 Word4, 
Word5 Word6 Word7 Word8 Word9 Word10 Word11

Word1 Word2 Word3 Word4,
 Word5 Word6 Word7 Word8 Word9 Word10,
 Word11 Word12 Word13 Word14 Word15 Word16 

Word1 Word2 Word3 Word4, 
Word5 Word6 Word7 Word8 Word9 Word10 Word11 and
 Word12 Word13 Word14 Word15 

Word1 Word2 Word3 Word4 and Word5

提前致谢!

这是您期望的答案吗?

echo "Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10, Word11 Word12 Word13 Word14 Word15 Word16 Word17 Word18 Word19 Word20 Word21 and Word22 Word23 Word24." | grep -oE '[a-zA-Z0-9,.]+' | awk '
BEGIN {
    cnt = 0
}
{
    str = str " " [=10=]
    if ([=10=] ~ /,$/){
        print str
        cnt = 0
        str = ""
    }
    else if (cnt < 10){
        cnt++
    }
    else {
        print str
        cnt = 0
        str = ""
    }
} END {
    print str
}' | sed 's/^ *//'
Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10,
Word11 Word12 Word13 Word14 Word15 Word16 Word17 Word18 Word19 Word20 Word21
and Word22 Word23 Word24.

请尝试以下操作:

awk '{
    while (split([=10=], a, "( +and +)|( +)") > 10 && match([=10=], "( +and +)|,")) {
        if (match([=10=], "[^,]+,")) {
            # puts a newline after the 1st comma
            print substr([=10=], 1, RLENGTH)
            [=10=] = substr([=10=], RLENGTH + 1)
        } else {
            # puts a newline before the 1st substring " and "
            n = split([=10=], a, " +and +")
            if (a[1] == "") {               # [=10=] starts with " and "
                a[1] = " and " a[2]
                for (i = 2; i < n; i++) {
                    a[i] = a[i+1]
                }
                n--
            }
            print a[1]
            [=10=] = " and " a[2]
            for (i = 3; i <= n; i++) {      # there are two ore more " and "
                [=10=] = [=10=] " and " a[i]
            }
        }
    }
    print
}' input.txt

给定输入的输出:

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9

Word1 Word2 Word3 Word4,
 Word5 Word6 Word7 Word8 Word9 Word10 Word11

Word1 Word2 Word3 Word4,
 Word5 Word6 Word7 Word8 Word9 Word10,
 Word11 Word12 Word13 Word14 Word15 Word16

Word1 Word2 Word3 Word4,
 Word5 Word6 Word7 Word8 Word9 Word10 Word11
 and Word12 Word13 Word14 Word15

Word1 Word2 Word3 Word4 and Word5

[说明]

  • 它在同一记录上迭代,而模式 space 包含 超过 10 个字段(不包括单词 "and")&& 模式 space 包括行分隔符以启用连续拆分。
  • 如果模式space包含逗号,则打印左手 并用右手更新图案 space。
  • 如果模式space中包含“and”字样,处理有点 困难,因为这个词仍然在更新的模式 space 中。 我的方法在某种意义上可能并不优雅,但即使是记录 包含多个(两个或更多)" 和 "s.

[编辑]

如果您想将单词 and 作为字数的一部分,请替换第 2 行:

while (split([=12=], a, "( +and +)|( +)") > 10 && match([=12=], "( +and +)|,")) {

与:

while (NF > 10 && match([=13=], "( +and +)|,")) {

此外,如果您允许单词 and 跟在 原始行:脚本将稍微简化为:

awk '{
    while (NF > 10 && match([=14=], "( +and +)|,")) {
        if (match([=14=], "[^,]+,")) {
            # puts a newline after the 1st comma
            print substr([=14=], 1, RLENGTH)
            [=14=] = substr([=14=], RLENGTH + 1)
        } else {
            # puts a newline after the 1st substring " and "
            n = split([=14=], a, " +and +")
            print a[1] " and"
            [=14=] = " " a[2]
            for (i = 3; i <= n; i++) {      # there are two ore more " and "
                [=14=] = [=14=] " and " a[i]
            }
        }
    }
    print
}' input.txt

另外,如果Perl是你的选项,你可以说:

perl -ne '{
    while (split > 10 && /( +and +)|,/) {
        if (/^.*?(, *| +and +)/) {
            print $&, "\n";
            $_ = " $'\''";
        }
    }
    print
}' input.txt

希望这对您有所帮助。