使用 awk 比较两个不同文件中的两个数值范围,并打印 file1 中的所有行和 file2 中的匹配行
Compare two numerical ranges in two distincts files with awk and print ALL lines from file1 and the matching ones from file2
这个新问题是最近一个问题的后续问题:。提议的完美工作的解决方案对于下游分析并不实用(对我的问题的误解,而不是对有效的解决方案)。
我有一个包含 3 列的文件 1。第 2 列和第 3 列定义了一个数值范围。数据在第 2 列中从小到大排序。数值范围从不重叠。
文件1
S 24 96
S 126 352
S 385 465
S 548 600
S 621 707
S 724 736
我有第二个文件 2(测试),结构类似。
文件2
S 27 93
S 123 348
S 542 584
S 726 740
S 1014 2540
S 12652 12987
期望的输出:打印 file1 中的所有行及其旁边的 file2 中数值范围与 file1 重叠(包括部分重叠)的行。如果 file2 的范围与 file1 的范围没有重叠,则在文件 1 的范围旁边打印零。
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 355 * 123-355 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736
根据@EdMorton 对上一个问题的回答,我修改了 tst.awk 脚本的打印命令以添加这些新功能。此外,我还将命令 file1/file2 更改为 file2/file1 以打印 file1 中的所有行(无论第二个文件是否匹配)
'NR == FNR {
begs2ends[] =
next
}
{
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ( >= beg) && ( <= end) ) ||
( ( >= beg) && ( <= end) ) ||
( ( <= beg) && ( >= end) ) ) {
print [=13=],"\t",,"\t",beg,"\t",end
else
print [=13=],"\t","0"
next
}
}
}
注意:$1 在文件 1 和文件 2 中是相同的。这就是为什么我使用 print ... $1 来让它出现的原因。不知道如何从 file2 而不是 file1 打印它(如果我理解正确的话,这个 $1 指的是 file1.
然后我用 awk -f tst.awk file2 file1
启动分析
脚本不接受 else
参数,我不明白为什么?我假设它与循环有关,但我尝试了一些更改但没有成功。
如果你能帮助我,谢谢。
假设:
file1
中的一个范围只能与 file2
中的一个范围重叠
当前代码几乎是正确的,只需要在大括号的位置上做一些工作(使用一些一致的缩进有助于):
awk '
BEGIN { OFS="\t" } # output field delimiter is "\t"
NR == FNR { begs2ends[] = ; next }
{
# = # uncomment to have current line ([=10=]) reformatted with "\t" delimiters during print
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ( >= beg) && ( <= end) ) ||
( ( >= beg) && ( <= end) ) ||
( ( <= beg) && ( >= end) ) ) {
print [=10=],,beg,end # spacing within [=10=] unchanged, 3 new fields prefaced with "\t"
next
}
}
# if we get this far it is because we have exhausted the "for" loop
# (ie, found no overlaps) so print current line + "0"
print [=10=],"0" # spacing within [=10=] unchanged, 1 new field prefaced with "\t"
}
' file2 file1
这会生成:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
取消注释 =
行后,输出变为:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
S 900 1000 S 901 905
@markp-fuso 的答案略有不同
适用于 GNU awk:另存为 overlaps.awk
BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" }
function in_range(val, min, max) { return min <= val && val <= max }
NR == FNR {
line[FNR] = [=10=]
lo[FNR] =
hi[FNR] =
next
}
{
overlap = "0"
for (i in line) {
if (in_range(lo[i], , ) || in_range(hi[i], , )) {
overlap = line[i]
delete line[i]
break
}
}
print [=10=], overlap
}
然后
gawk -f overlaps.awk file2 file1 | column -t
产出
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
$ cat tst.awk
BEGIN { OFS="\t" }
NR == FNR {
ranges[++numRanges] = [=10=]
next
}
{
overlapped = 0
for ( i=1; i<=numRanges; i++ ) {
range = ranges[i]
split(range,vals)
beg = vals[2]+0
end = vals[3]+0
if ( ( ( >= beg) && ( <= end) ) ||
( ( >= beg) && ( <= end) ) ||
( ( <= beg) && ( >= end) ) ) {
overlapped = 1
break
}
}
if ( overlapped ) {
print [=10=], range, sprintf("* %d-%d overlaps with %d-%d", beg, end, , )
}
else {
print [=10=], 0, sprintf("* nothing in %s overlaps with this range", ARGV[1])
}
}
$ awk -f tst.awk file2 file1 | column -s$'\t' -t
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 348 * 123-348 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736
这个新问题是最近一个问题的后续问题:
我有一个包含 3 列的文件 1。第 2 列和第 3 列定义了一个数值范围。数据在第 2 列中从小到大排序。数值范围从不重叠。
文件1
S 24 96
S 126 352
S 385 465
S 548 600
S 621 707
S 724 736
我有第二个文件 2(测试),结构类似。
文件2
S 27 93
S 123 348
S 542 584
S 726 740
S 1014 2540
S 12652 12987
期望的输出:打印 file1 中的所有行及其旁边的 file2 中数值范围与 file1 重叠(包括部分重叠)的行。如果 file2 的范围与 file1 的范围没有重叠,则在文件 1 的范围旁边打印零。
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 355 * 123-355 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736
根据@EdMorton 对上一个问题的回答,我修改了 tst.awk 脚本的打印命令以添加这些新功能。此外,我还将命令 file1/file2 更改为 file2/file1 以打印 file1 中的所有行(无论第二个文件是否匹配)
'NR == FNR {
begs2ends[] =
next
}
{
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ( >= beg) && ( <= end) ) ||
( ( >= beg) && ( <= end) ) ||
( ( <= beg) && ( >= end) ) ) {
print [=13=],"\t",,"\t",beg,"\t",end
else
print [=13=],"\t","0"
next
}
}
}
注意:$1 在文件 1 和文件 2 中是相同的。这就是为什么我使用 print ... $1 来让它出现的原因。不知道如何从 file2 而不是 file1 打印它(如果我理解正确的话,这个 $1 指的是 file1.
然后我用 awk -f tst.awk file2 file1
启动分析脚本不接受 else
参数,我不明白为什么?我假设它与循环有关,但我尝试了一些更改但没有成功。
如果你能帮助我,谢谢。
假设:
file1
中的一个范围只能与file2
中的一个范围重叠
当前代码几乎是正确的,只需要在大括号的位置上做一些工作(使用一些一致的缩进有助于):
awk '
BEGIN { OFS="\t" } # output field delimiter is "\t"
NR == FNR { begs2ends[] = ; next }
{
# = # uncomment to have current line ([=10=]) reformatted with "\t" delimiters during print
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ( >= beg) && ( <= end) ) ||
( ( >= beg) && ( <= end) ) ||
( ( <= beg) && ( >= end) ) ) {
print [=10=],,beg,end # spacing within [=10=] unchanged, 3 new fields prefaced with "\t"
next
}
}
# if we get this far it is because we have exhausted the "for" loop
# (ie, found no overlaps) so print current line + "0"
print [=10=],"0" # spacing within [=10=] unchanged, 1 new field prefaced with "\t"
}
' file2 file1
这会生成:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
取消注释 =
行后,输出变为:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
S 900 1000 S 901 905
@markp-fuso 的答案略有不同
适用于 GNU awk:另存为 overlaps.awk
BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" }
function in_range(val, min, max) { return min <= val && val <= max }
NR == FNR {
line[FNR] = [=10=]
lo[FNR] =
hi[FNR] =
next
}
{
overlap = "0"
for (i in line) {
if (in_range(lo[i], , ) || in_range(hi[i], , )) {
overlap = line[i]
delete line[i]
break
}
}
print [=10=], overlap
}
然后
gawk -f overlaps.awk file2 file1 | column -t
产出
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
$ cat tst.awk
BEGIN { OFS="\t" }
NR == FNR {
ranges[++numRanges] = [=10=]
next
}
{
overlapped = 0
for ( i=1; i<=numRanges; i++ ) {
range = ranges[i]
split(range,vals)
beg = vals[2]+0
end = vals[3]+0
if ( ( ( >= beg) && ( <= end) ) ||
( ( >= beg) && ( <= end) ) ||
( ( <= beg) && ( >= end) ) ) {
overlapped = 1
break
}
}
if ( overlapped ) {
print [=10=], range, sprintf("* %d-%d overlaps with %d-%d", beg, end, , )
}
else {
print [=10=], 0, sprintf("* nothing in %s overlaps with this range", ARGV[1])
}
}
$ awk -f tst.awk file2 file1 | column -s$'\t' -t
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 348 * 123-348 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736