使用 awk sub 在没有更改计数状态的字符串上添加数字前缀最多 5 个匹配项 "In a text file with multiples matchs per line"

Question

假设我下面有以下文件，我想在 5 范围内通过重复计数放置一个数字前缀，并将重复的数字作为前缀添加到 .dog:

[.dog]
-house
.cat
.dog
foo.dogfish
[.dog]
-house
-house
.cat
foo.dogfish
.cat
.dog
[.dog]
-house

  [ -kitchen cat.dog 45_house-dog_.dogfish ]
 
    house_dogfish_cat

    'cat_.dog' -kitchen '
    :.;"house.cat()";
     food' today.cat


  [ -kitchen cat.dog ]
 
    house_dogfish_cat

    'cat_.dog' -kitchen '
    :.;"house.cat()";
     food' today.cat
 

  [ -kitchen cat.dog ]
 
    house_dogfish_cat

    'cat_.dog' -kitchen '
    :.;"house.cat()";
     food' today.cat

没有不应该改.dog的情况，那么.dog应该改成number.dog，甚至当foo.dogfish也改成foo<number>.dogfish，所以比我的输出：

[1.dog]
-house
.cat
1.dog
foo1.dogfish
[1.dog]
-house
-house
.cat
foo1.dogfish
.cat
2.dog
[2.dog]
-house

  [ -kitchen cat2.dog 45_house-dog_2.dogfish ]
 
    house_dogfish_cat

    'cat_2.dog' -kitchen '
    :.;"house.cat()";
     food' today.cat


  [ -kitchen cat3.dog ]
 
    house_dogfish_cat

    'cat_3.dog' -kitchen '
    :.;"house.cat()";
     food' today.cat
 

  [ -kitchen cat3.dog ]
 
    house_dogfish_cat

    'cat_3.dog' -kitchen '
    :.;"house.cat()";
     food' today.cat

编辑更新 1： 特别是当需要 [ -kitchen cat.dog 45_house-dog_.dogfish ] 时，这会更改为 [ -kitchen catnumber.dog 45_house-dog_number.dogfish ]。我认为避免执行的解决方案是使用 BEGIN{IGNORECASE =1 }/*not-match/.

之类的东西

我有用户 Cyrus 的代码：

 awk 'BEGIN{ count=1 } /\.dog/{ t=count; sub(/\..*/,"",t); sub(".dog", t "&"); count+=.2 }1' file

唯一的问题是这段代码将 [ -kitchen cat.dog 45_house-dog_.dogfish ] 改为 [ -kitchen cat2.dog 45_house-dog_.dogfish ] 而不是 [ -kitchen cat2.dog 45_house-dog_2.dogfish ]。 我们可以总结出问题是 .dog 出现的行一旦有正确的前缀，而 .dog 行出现不止一次只有第一个 .dog 出现是以数字为前缀。

Answer 1

假设：

对于字符串的每次出现 .dog 在所述字符串前加上一个整数 (pfx)
所述整数 (pfx) 从 @1 开始并在每次 n=5 使用

+1

一个awk想法：

awk -v n=5 '
{ newline=""
  while ( x=index([=10=],".dog") ) {
        if (cnt++ % n == 0) pfx++                              # increment our prefix? cnt == number of times we have used pfx
        newline=newline substr([=10=],1,x-1) pfx substr([=10=],x,4)    # append pfx to this occurrence of ".dog"
        [=10=]=substr([=10=],x+4)                                      # reset [=10=] to rest of line
  }
  print newline [=10=]                                             # print newline plus anything left in [=10=]
}
' dog.dat

注意： 4（在x,4和x+4中）是指搜索字符串的长度.dog ;如果 OP 要搜索不同的字符串，则需要相应地更新 4's（例如，如果搜索 .dogs，则将两个 4's 更改为 5's）

这会生成：

[1.dog]
-house
.cat
1.dog
foo1.dogfish
[1.dog]
-house
-house
.cat
foo1.dogfish
.cat
2.dog
[2.dog]
-house

  [ -kitchen cat2.dog 45_house-dog_2.dogfish ]

    house_dogfish_cat

    'cat_2.dog' -kitchen '
    :.;"house.cat()";
     food' today.cat


  [ -kitchen cat3.dog ]

    house_dogfish_cat

    'cat_3.dog' -kitchen '
    :.;"house.cat()";
     food' today.cat


  [ -kitchen cat3.dog ]

    house_dogfish_cat

    'cat_3.dog' -kitchen '
    :.;"house.cat()";
     food' today.cat

fwiw，使用 n=3 和一行输入 = ".dog .dog .dog .dog .dog .dog .dog .dog .dog .dog" 这会生成：

1.dog 1.dog 1.dog 2.dog 2.dog 2.dog 3.dog 3.dog 3.dog 4.dog

Answer 2

你可以这样做：

awk -v RS='\.dog' -v NR=4 '{ORS = int(NR/5)".dog"; print}'

除了一个额外的尾随 N.dog（在文件的最后）之外，这有效。

因此您可以使用此版本修复尾随 N.dog（或更好的方法？（编辑： 在末尾添加了更好的方法））：

awk -v RS='\.dog' \
'{
    lines[NR]=[=11=] int((NR+4)/5)".dog"
}

END {
        ORS = ""

        for(i=0; i<NR; i++) {
            print lines[i]
        }

        print [=11=]
}'

解释：使用目标字符串（.dog）作为记录分隔符，统计记录数，每条记录与记录分隔符之间打印count/5。

注：POSIX 2018年：

If RS contains more than one character, the results are unspecified.

但是，各种 awk 确实为 RS 实现了正则表达式。它记录在 mawk 和 gawk 中。以上两个示例都在 mawk、gawk 和 busybox awk.

中进行了测试

编辑，更好的解决方案： 根据评论，这是一个完整的解决方案，不会将输入文件复制到内存，也不会打印额外的 N.dog:

awk -v RS='\.dog' -v NR=4 \
'(NR != 5) {print line}
{ORS = int(NR/5)".dog"; line=[=12=]}
END {ORS = ""; print}'

或更具可读性（相同）：

awk -v RS='\.dog' -v NR=4 \
'{
    if (NR != 5) {
        print line
    }

    ORS = int(NR/5)".dog"
    line=[=13=]
}

END {
    ORS = ""
    print
}'

使用 awk sub 在没有更改计数状态的字符串上添加数字前缀最多 5 个匹配项 "In a text file with multiples matchs per line"

Prefixing numerically a string without change count status up to 5 matchs "In a text file with multiples matchs per line" using awk sub

awk

text-processing

prefix

gsub