如果匹配字符串，如何将相似的行合并为一个？

Question

我想编辑下面给定的文件，关键字是第一个 (0)/(1)/(2) 等，最后是 <STRING>。

如果任何两行以相同的数字开头并且其中有 [STRING] 则只保留第一行，其他的应该删除并在第一行的最后添加注释“ --- total_number_of_lines"，作为 --- 2 或 --- 3 或 --- 4

请参考以下示例。

(0) some text let it be [STRING]
(0) some text1 let it be
(1) some text2 let it be 
(1) some text3 let it be [STRING]
(1) some text4 let it be [STRING]
(1) some text5 let it be [STRING]
(1) some text6 let it be [STRING]
(1) some text7 let it be [XYZ]
(0) some text8 let it be [STRING]
(0) some text9 let it be
(1) some text10 let it be 
(1) some text11 let it be [STRING]
(1) some text12 let it be [STRING]
(2) some text13 let it be [STRING]
(2) some text14 let it be [STRING]
(2) some text15 let it be [STRING]
(3) some text16 let it be [ABC]
(3) some text17 let it be [STRING]
(3) some text18 let it be [STRING]
(1) some text19 let it be [STRING]
(1) some text20 let it be [STRING]
(1) some text21 let it be [STRING]
(1) some text22 let it be [STRING]
(1) some text23 let it be [DEF]

这需要编辑为：

(0) some text let it be [STRING]
(0) some text1 let it be
(1) some text2 let it be 
(1) some text3 let it be [STRING] --- 4
(1) some text7 let it be [XYZ]
(0) some text8 let it be [STRING]
(0) some text9 let it be
(1) some text10 let it be 
(1) some text11 let it be [STRING] --- 2
(2) some text13 let it be [STRING] --- 3
(3) some text16 let it be [ABC]
(3) some text17 let it be [STRING] --- 2
(1) some text19 let it be [STRING] --- 4 
(1) some text23 let it be [DEF]

有人有什么建议吗？更正了问题以更清楚地说明要求。

Answer 1

解决这个问题的主要技巧是让处理开始和处理结束时的行为正确。否则，这只是能够保留一个计数器并与上一行进行比较的问题。以下是在 Tcl 中的操作方法：

# How to write out a line with an optional count; customise as necessary
proc writeLine {count line} {
     if {$count > 1} {
         puts "$line --- $count"
     } else {
         puts $line
     }
}

# Note that the prev variable is not set at this point

set count 0
while {[gets stdin line] >= 0} {
    # Extract the parts we care about
    if {[regexp {^(\(\d+\)).*(\[[^][]+\])$} $line -> a b]} {
        set AB $a$b
        if {[info exist prev] && $prevAB ne $AB} {
            writeLine $count $prev
            set count 0
            set prev $line
        } elseif {![info exist prev]} {
            set prev $line
        }
        set prevAB $AB
        incr count
    } else {
        # Unmatched line; flush and print
        if {[info exist prev]} {
            writeLine $count $prev
        }
        writeLine 1 $line
        set count 0
        unset -nocomplain prev prevAB
    }
}
# Print out the final line if necessary
if {[info exist prev]} {
    writeLine $count $prev
}

Answer 2

编辑：对于更改后的要求，可以将此方法修改为

awk 'function tok() { return [=10=] ~ /\[STRING\]/ ?  : "" } function reset() { lastline = [=10=]; prev = tok(); ctr = 1 } function commit() { print lastline (ctr == 1 ? "" : " --- " ctr); reset() } NR == 1 { reset(); next } !tok() || prev != tok() { commit(); next } { ++ctr } END { commit(); }'

一般的做法是在写入之前读取一行。块，包括仅由一行组成的块，在结束后打印。代码的工作原理如下：

# Token for repetition detection: Lines that contain [STRING] are exempt,
# so for them we report an empty / no token.
function tok() {
  return [=11=] ~ /\[STRING\]/ ?  : ""
}

# reset counters etc. when a new block begins
function reset() {
  lastline = [=11=]
  prev = tok()
  ctr = 1
}

# Write saved line, with counter if appropriate
function commit() {
  print lastline (ctr == 1 ? "" : " --- " ctr)
  reset()
}

# We write every block after it is over, and this includes single lines.
# So: First line, just prime the pump, do nothing else.
NR == 1 {
  reset()
  next
}

# If the new line is exempt (no token reported) or the token changed,
# print stuff, reprime pump.
!tok() || prev != tok() { 
  commit()
  next
}

# otherwise increase counter
{
  ++ctr
}

# and in the end, handle the last block.
END {
  commit()
}

Answer 3

这符合您的要求（If any two lines start with same number and has [STRING] in it then first line only should be kept, other should be deleted and append a comment at last of first line with "--- total_number_of_lines", as --- 2 or --- 3 or --- 4）：

$ cat tst.awk
NR==FNR { if (/\[STRING\]$/) cnt[]++; next }
/\[STRING\]$/ {
    if (seen[]++) next
    else [=10=] = [=10=] " --- " cnt[]
}
1

$ awk -f tst.awk file file
(0) some text let it be [STRING] --- 2
(0) some text1 let it be
(1) some text2 let it be
(1) some text3 let it be [STRING] --- 10
(1) some text7 let it be [XYZ]
(0) some text9 let it be
(1) some text10 let it be
(2) some text13 let it be [STRING] --- 3
(3) some text16 let it be [ABC]
(3) some text17 let it be [STRING] --- 2
(1) some text23 let it be [DEF]

但显然这与您的预期输出不匹配，因为您的预期输出与您声明的要求不匹配。

Answer 4

另一种方法是使用unix "uniq"命令。在 MacOS (BSD) 上，它是：

uniq -c -s20

如果要计算相同的行数，比较时排除前 20 个字符。这将把伯爵放在最前面。您可以将计数移动到末尾：

uniq -c -s20 | sed -E 's/^ *([0-9]+) (.*)/ --- /g'

在 unbuntu 上，是 sed -r，不是 sed -E。

如果匹配字符串，如何将相似的行合并为一个？

How to merge similar lines in to one, if matches a string?

regex

shell

awk

sed

tcl