通过 awk 从离散到连续的数字范围
Discrete to continuous number ranges via awk
假设一个文本文件 file
包含多个离散的数字范围,每行一个。每个范围前面都有一个字符串(即范围名称)。每个范围的下限和上限由破折号分隔。每个数字范围后跟一个分号。各个范围已排序(即范围 101-297 在 1299-1301 之前)并且不重叠。
$cat file
foo 101-297;
bar 1299-1301;
baz 1314-5266;
请注意,在上面的示例中,三个范围没有形成从整数 1 开始的连续范围。
我相信 awk 是填补缺失数字范围的合适工具,这样所有范围加在一起形成从 {1} 到 {upper bound of the last范围}。如果是这样,您会使用什么 awk command/function 来执行该任务?
$cat file | sought_awk_command
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
new3 1302-1313;
baz 1314-5266;
--
编辑 1:经过仔细评估,下面建议的代码在另一个简单示例中失败了。
$cat example2
foo 101-297;
bar 1299-1301;
baz 1302-1314; # Notice that ranges "bar" and "baz" are continuous to one another
qux 1399-5266;
$ awk -F'[ -]' '-Q>1{print "new"++o,Q+1"-"-1";";Q=} 1' example2
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
baz 1302-1314;
new3 1302-1398; # ERROR HERE: Notice that range "new3" has a lower bound that is equal to upper bound of "bar", not of "baz".
qux 1399-5266;
--
编辑 2: 非常感谢 RavinderSingh13 帮助解决了这个问题。但是,建议的代码仍然生成与给定 objective.
不一致的输出
$ cat example3
foo 35025-35144;
bar 35259-35375;
baz 35376-35624;
qux 37911-39434;
$ awk -F'[ -]' '-Q+0>=1{print "new"++o,Q+1"-"-1";";Q=} {Q=;print}' example3
new1 1-35024;
foo 35025-35144;
new2 35145-35258;
bar 35259-35375;
new3 35376-35375; # ERROR HERE: Notice that range "new3" has been added, even though ranges "bar" and "baz" are contiguous.
baz 35376-35624;
new4 35625-37910;
qux 37911-39434;
尝试:
awk -F'[ -]' '-Q>1{print "new"++o,Q+1"-"-1";";Q=} 1' Input_file
编辑:现在也添加一个非单一的衬垫解决方案,并提供适当的解释。
awk -F'[ -]' ' ###Setting field separator as space, dash here.
-Q>1{ ###Checking here if 3rd field and variable Qs subtraction is greater than 1, if yes then perform following.
print "new"++o,Q+1"-"-1";"; ###printing the string new with a incrementing value of variable o each time, then variable Qs value with adding 1 to it, then current line -1 and semi colon.
Q= ###Assigning the variable Q value to 4th field of the current line here too.
}
1 ###printing the current line here.
' Input_file ###Mentioning the Input_file here too.
EDIT2: 根据 OP 的条件再添加一个答案。
awk -F'[ -]' '-Q+0>=1{print "new"++o,Q+1"-"-1";";Q=} {Q=;print}' Input_file
这对于可以重叠的范围没有问题,如您在原始示例 2 中所示,其中 bar 1299-1301;
和 baz 1301-1314;
在 1301
处重叠。
$ cat tst.awk
{ split(,curr,/[-;]/); currStart=curr[1]; currEnd=curr[2] }
currStart > (prevEnd+1) { print "new"++cnt, prevEnd+1 "-" currStart-1 ";" }
{ print; prevEnd=currEnd }
$ awk -f tst.awk file
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
new3 1302-1313;
baz 1314-5266;
$ awk -f tst.awk example2
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
baz 1301-1314;
new3 1315-1398;
qux 1399-5266;
$ awk -f tst.awk example3
new1 1-35024;
foo 35025-35144;
new2 35145-35258;
bar 35259-35375;
baz 35376-35624;
new3 35625-37910;
qux 37911-39434;
$ cat file1
foo 2-100
bar 102-200
$ awk F' +|[-;}' 'p+1<{print "new" ++q, p+1 "-" -1 ";"}p=' file1
new1 1-1;
foo 2-100
new2 101-101;
bar 102-200
$ cat file2
foo 101-297;
bar 1299-1301;
baz 1314-5266;
$ awk -F' +|[-;]' 'p+1<{print "new" ++q, p+1 "-" -1 ";"}p=' file2
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
new3 1302-1313;
baz 1314-5266;
解释:
$ awk -F' +|[-;]' ' # FS is ; - or a bunch of spaces
p+1 < { # if p revious +1 is still less than new
print "new"++q,p+1 "-" -1 ";" # print a "new" line
}
p= # set future p and implicit print of record *
' file2 # * as all values are above 0
假设一个文本文件 file
包含多个离散的数字范围,每行一个。每个范围前面都有一个字符串(即范围名称)。每个范围的下限和上限由破折号分隔。每个数字范围后跟一个分号。各个范围已排序(即范围 101-297 在 1299-1301 之前)并且不重叠。
$cat file
foo 101-297;
bar 1299-1301;
baz 1314-5266;
请注意,在上面的示例中,三个范围没有形成从整数 1 开始的连续范围。
我相信 awk 是填补缺失数字范围的合适工具,这样所有范围加在一起形成从 {1} 到 {upper bound of the last范围}。如果是这样,您会使用什么 awk command/function 来执行该任务?
$cat file | sought_awk_command
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
new3 1302-1313;
baz 1314-5266;
--
编辑 1:经过仔细评估,下面建议的代码在另一个简单示例中失败了。
$cat example2
foo 101-297;
bar 1299-1301;
baz 1302-1314; # Notice that ranges "bar" and "baz" are continuous to one another
qux 1399-5266;
$ awk -F'[ -]' '-Q>1{print "new"++o,Q+1"-"-1";";Q=} 1' example2
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
baz 1302-1314;
new3 1302-1398; # ERROR HERE: Notice that range "new3" has a lower bound that is equal to upper bound of "bar", not of "baz".
qux 1399-5266;
--
编辑 2: 非常感谢 RavinderSingh13 帮助解决了这个问题。但是,建议的代码仍然生成与给定 objective.
不一致的输出$ cat example3
foo 35025-35144;
bar 35259-35375;
baz 35376-35624;
qux 37911-39434;
$ awk -F'[ -]' '-Q+0>=1{print "new"++o,Q+1"-"-1";";Q=} {Q=;print}' example3
new1 1-35024;
foo 35025-35144;
new2 35145-35258;
bar 35259-35375;
new3 35376-35375; # ERROR HERE: Notice that range "new3" has been added, even though ranges "bar" and "baz" are contiguous.
baz 35376-35624;
new4 35625-37910;
qux 37911-39434;
尝试:
awk -F'[ -]' '-Q>1{print "new"++o,Q+1"-"-1";";Q=} 1' Input_file
编辑:现在也添加一个非单一的衬垫解决方案,并提供适当的解释。
awk -F'[ -]' ' ###Setting field separator as space, dash here.
-Q>1{ ###Checking here if 3rd field and variable Qs subtraction is greater than 1, if yes then perform following.
print "new"++o,Q+1"-"-1";"; ###printing the string new with a incrementing value of variable o each time, then variable Qs value with adding 1 to it, then current line -1 and semi colon.
Q= ###Assigning the variable Q value to 4th field of the current line here too.
}
1 ###printing the current line here.
' Input_file ###Mentioning the Input_file here too.
EDIT2: 根据 OP 的条件再添加一个答案。
awk -F'[ -]' '-Q+0>=1{print "new"++o,Q+1"-"-1";";Q=} {Q=;print}' Input_file
这对于可以重叠的范围没有问题,如您在原始示例 2 中所示,其中 bar 1299-1301;
和 baz 1301-1314;
在 1301
处重叠。
$ cat tst.awk
{ split(,curr,/[-;]/); currStart=curr[1]; currEnd=curr[2] }
currStart > (prevEnd+1) { print "new"++cnt, prevEnd+1 "-" currStart-1 ";" }
{ print; prevEnd=currEnd }
$ awk -f tst.awk file
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
new3 1302-1313;
baz 1314-5266;
$ awk -f tst.awk example2
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
baz 1301-1314;
new3 1315-1398;
qux 1399-5266;
$ awk -f tst.awk example3
new1 1-35024;
foo 35025-35144;
new2 35145-35258;
bar 35259-35375;
baz 35376-35624;
new3 35625-37910;
qux 37911-39434;
$ cat file1
foo 2-100
bar 102-200
$ awk F' +|[-;}' 'p+1<{print "new" ++q, p+1 "-" -1 ";"}p=' file1
new1 1-1;
foo 2-100
new2 101-101;
bar 102-200
$ cat file2
foo 101-297;
bar 1299-1301;
baz 1314-5266;
$ awk -F' +|[-;]' 'p+1<{print "new" ++q, p+1 "-" -1 ";"}p=' file2
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
new3 1302-1313;
baz 1314-5266;
解释:
$ awk -F' +|[-;]' ' # FS is ; - or a bunch of spaces
p+1 < { # if p revious +1 is still less than new
print "new"++q,p+1 "-" -1 ";" # print a "new" line
}
p= # set future p and implicit print of record *
' file2 # * as all values are above 0