如果匹配字符串,如何将相似的行合并为一个?
How to merge similar lines in to one, if matches a string?
我想编辑下面给定的文件,关键字是第一个 (0)/(1)/(2) 等,最后是 <STRING>
。
如果任何两行以相同的数字开头并且其中有 [STRING]
则只保留第一行,其他的应该删除并在第一行的最后添加注释“ --- total_number_of_lines",作为 --- 2 或 --- 3 或 --- 4
请参考以下示例。
(0) some text let it be [STRING]
(0) some text1 let it be
(1) some text2 let it be
(1) some text3 let it be [STRING]
(1) some text4 let it be [STRING]
(1) some text5 let it be [STRING]
(1) some text6 let it be [STRING]
(1) some text7 let it be [XYZ]
(0) some text8 let it be [STRING]
(0) some text9 let it be
(1) some text10 let it be
(1) some text11 let it be [STRING]
(1) some text12 let it be [STRING]
(2) some text13 let it be [STRING]
(2) some text14 let it be [STRING]
(2) some text15 let it be [STRING]
(3) some text16 let it be [ABC]
(3) some text17 let it be [STRING]
(3) some text18 let it be [STRING]
(1) some text19 let it be [STRING]
(1) some text20 let it be [STRING]
(1) some text21 let it be [STRING]
(1) some text22 let it be [STRING]
(1) some text23 let it be [DEF]
这需要编辑为:
(0) some text let it be [STRING]
(0) some text1 let it be
(1) some text2 let it be
(1) some text3 let it be [STRING] --- 4
(1) some text7 let it be [XYZ]
(0) some text8 let it be [STRING]
(0) some text9 let it be
(1) some text10 let it be
(1) some text11 let it be [STRING] --- 2
(2) some text13 let it be [STRING] --- 3
(3) some text16 let it be [ABC]
(3) some text17 let it be [STRING] --- 2
(1) some text19 let it be [STRING] --- 4
(1) some text23 let it be [DEF]
有人有什么建议吗?更正了问题以更清楚地说明要求。
解决这个问题的主要技巧是让处理开始和处理结束时的行为正确。否则,这只是能够保留一个计数器并与上一行进行比较的问题。以下是在 Tcl 中的操作方法:
# How to write out a line with an optional count; customise as necessary
proc writeLine {count line} {
if {$count > 1} {
puts "$line --- $count"
} else {
puts $line
}
}
# Note that the prev variable is not set at this point
set count 0
while {[gets stdin line] >= 0} {
# Extract the parts we care about
if {[regexp {^(\(\d+\)).*(\[[^][]+\])$} $line -> a b]} {
set AB $a$b
if {[info exist prev] && $prevAB ne $AB} {
writeLine $count $prev
set count 0
set prev $line
} elseif {![info exist prev]} {
set prev $line
}
set prevAB $AB
incr count
} else {
# Unmatched line; flush and print
if {[info exist prev]} {
writeLine $count $prev
}
writeLine 1 $line
set count 0
unset -nocomplain prev prevAB
}
}
# Print out the final line if necessary
if {[info exist prev]} {
writeLine $count $prev
}
编辑:对于更改后的要求,可以将此方法修改为
awk 'function tok() { return [=10=] ~ /\[STRING\]/ ? : "" } function reset() { lastline = [=10=]; prev = tok(); ctr = 1 } function commit() { print lastline (ctr == 1 ? "" : " --- " ctr); reset() } NR == 1 { reset(); next } !tok() || prev != tok() { commit(); next } { ++ctr } END { commit(); }'
一般的做法是在写入之前读取一行。块,包括仅由一行组成的块,在结束后打印。代码的工作原理如下:
# Token for repetition detection: Lines that contain [STRING] are exempt,
# so for them we report an empty / no token.
function tok() {
return [=11=] ~ /\[STRING\]/ ? : ""
}
# reset counters etc. when a new block begins
function reset() {
lastline = [=11=]
prev = tok()
ctr = 1
}
# Write saved line, with counter if appropriate
function commit() {
print lastline (ctr == 1 ? "" : " --- " ctr)
reset()
}
# We write every block after it is over, and this includes single lines.
# So: First line, just prime the pump, do nothing else.
NR == 1 {
reset()
next
}
# If the new line is exempt (no token reported) or the token changed,
# print stuff, reprime pump.
!tok() || prev != tok() {
commit()
next
}
# otherwise increase counter
{
++ctr
}
# and in the end, handle the last block.
END {
commit()
}
这符合您的要求(If any two lines start with same number and has [STRING] in it then first line only should be kept, other should be deleted and append a comment at last of first line with "--- total_number_of_lines", as --- 2 or --- 3 or --- 4
):
$ cat tst.awk
NR==FNR { if (/\[STRING\]$/) cnt[]++; next }
/\[STRING\]$/ {
if (seen[]++) next
else [=10=] = [=10=] " --- " cnt[]
}
1
$ awk -f tst.awk file file
(0) some text let it be [STRING] --- 2
(0) some text1 let it be
(1) some text2 let it be
(1) some text3 let it be [STRING] --- 10
(1) some text7 let it be [XYZ]
(0) some text9 let it be
(1) some text10 let it be
(2) some text13 let it be [STRING] --- 3
(3) some text16 let it be [ABC]
(3) some text17 let it be [STRING] --- 2
(1) some text23 let it be [DEF]
但显然这与您的预期输出不匹配,因为您的预期输出与您声明的要求不匹配。
另一种方法是使用unix "uniq"命令。在 MacOS (BSD) 上,它是:
uniq -c -s20
如果要计算相同的行数,比较时排除前 20 个字符。这将把伯爵放在最前面。您可以将计数移动到末尾:
uniq -c -s20 | sed -E 's/^ *([0-9]+) (.*)/ --- /g'
在 unbuntu 上,是 sed -r,不是 sed -E。
我想编辑下面给定的文件,关键字是第一个 (0)/(1)/(2) 等,最后是 <STRING>
。
如果任何两行以相同的数字开头并且其中有 [STRING]
则只保留第一行,其他的应该删除并在第一行的最后添加注释“ --- total_number_of_lines",作为 --- 2 或 --- 3 或 --- 4
请参考以下示例。
(0) some text let it be [STRING]
(0) some text1 let it be
(1) some text2 let it be
(1) some text3 let it be [STRING]
(1) some text4 let it be [STRING]
(1) some text5 let it be [STRING]
(1) some text6 let it be [STRING]
(1) some text7 let it be [XYZ]
(0) some text8 let it be [STRING]
(0) some text9 let it be
(1) some text10 let it be
(1) some text11 let it be [STRING]
(1) some text12 let it be [STRING]
(2) some text13 let it be [STRING]
(2) some text14 let it be [STRING]
(2) some text15 let it be [STRING]
(3) some text16 let it be [ABC]
(3) some text17 let it be [STRING]
(3) some text18 let it be [STRING]
(1) some text19 let it be [STRING]
(1) some text20 let it be [STRING]
(1) some text21 let it be [STRING]
(1) some text22 let it be [STRING]
(1) some text23 let it be [DEF]
这需要编辑为:
(0) some text let it be [STRING]
(0) some text1 let it be
(1) some text2 let it be
(1) some text3 let it be [STRING] --- 4
(1) some text7 let it be [XYZ]
(0) some text8 let it be [STRING]
(0) some text9 let it be
(1) some text10 let it be
(1) some text11 let it be [STRING] --- 2
(2) some text13 let it be [STRING] --- 3
(3) some text16 let it be [ABC]
(3) some text17 let it be [STRING] --- 2
(1) some text19 let it be [STRING] --- 4
(1) some text23 let it be [DEF]
有人有什么建议吗?更正了问题以更清楚地说明要求。
解决这个问题的主要技巧是让处理开始和处理结束时的行为正确。否则,这只是能够保留一个计数器并与上一行进行比较的问题。以下是在 Tcl 中的操作方法:
# How to write out a line with an optional count; customise as necessary
proc writeLine {count line} {
if {$count > 1} {
puts "$line --- $count"
} else {
puts $line
}
}
# Note that the prev variable is not set at this point
set count 0
while {[gets stdin line] >= 0} {
# Extract the parts we care about
if {[regexp {^(\(\d+\)).*(\[[^][]+\])$} $line -> a b]} {
set AB $a$b
if {[info exist prev] && $prevAB ne $AB} {
writeLine $count $prev
set count 0
set prev $line
} elseif {![info exist prev]} {
set prev $line
}
set prevAB $AB
incr count
} else {
# Unmatched line; flush and print
if {[info exist prev]} {
writeLine $count $prev
}
writeLine 1 $line
set count 0
unset -nocomplain prev prevAB
}
}
# Print out the final line if necessary
if {[info exist prev]} {
writeLine $count $prev
}
编辑:对于更改后的要求,可以将此方法修改为
awk 'function tok() { return [=10=] ~ /\[STRING\]/ ? : "" } function reset() { lastline = [=10=]; prev = tok(); ctr = 1 } function commit() { print lastline (ctr == 1 ? "" : " --- " ctr); reset() } NR == 1 { reset(); next } !tok() || prev != tok() { commit(); next } { ++ctr } END { commit(); }'
一般的做法是在写入之前读取一行。块,包括仅由一行组成的块,在结束后打印。代码的工作原理如下:
# Token for repetition detection: Lines that contain [STRING] are exempt,
# so for them we report an empty / no token.
function tok() {
return [=11=] ~ /\[STRING\]/ ? : ""
}
# reset counters etc. when a new block begins
function reset() {
lastline = [=11=]
prev = tok()
ctr = 1
}
# Write saved line, with counter if appropriate
function commit() {
print lastline (ctr == 1 ? "" : " --- " ctr)
reset()
}
# We write every block after it is over, and this includes single lines.
# So: First line, just prime the pump, do nothing else.
NR == 1 {
reset()
next
}
# If the new line is exempt (no token reported) or the token changed,
# print stuff, reprime pump.
!tok() || prev != tok() {
commit()
next
}
# otherwise increase counter
{
++ctr
}
# and in the end, handle the last block.
END {
commit()
}
这符合您的要求(If any two lines start with same number and has [STRING] in it then first line only should be kept, other should be deleted and append a comment at last of first line with "--- total_number_of_lines", as --- 2 or --- 3 or --- 4
):
$ cat tst.awk
NR==FNR { if (/\[STRING\]$/) cnt[]++; next }
/\[STRING\]$/ {
if (seen[]++) next
else [=10=] = [=10=] " --- " cnt[]
}
1
$ awk -f tst.awk file file
(0) some text let it be [STRING] --- 2
(0) some text1 let it be
(1) some text2 let it be
(1) some text3 let it be [STRING] --- 10
(1) some text7 let it be [XYZ]
(0) some text9 let it be
(1) some text10 let it be
(2) some text13 let it be [STRING] --- 3
(3) some text16 let it be [ABC]
(3) some text17 let it be [STRING] --- 2
(1) some text23 let it be [DEF]
但显然这与您的预期输出不匹配,因为您的预期输出与您声明的要求不匹配。
另一种方法是使用unix "uniq"命令。在 MacOS (BSD) 上,它是:
uniq -c -s20
如果要计算相同的行数,比较时排除前 20 个字符。这将把伯爵放在最前面。您可以将计数移动到末尾:
uniq -c -s20 | sed -E 's/^ *([0-9]+) (.*)/ --- /g'
在 unbuntu 上,是 sed -r,不是 sed -E。