使用 awk 比较两个不同文件中的两个数值范围，并打印 file1 中的所有行和 file2 中的匹配行

Question

这个新问题是最近一个问题的后续问题：。提议的完美工作的解决方案对于下游分析并不实用（对我的问题的误解，而不是对有效的解决方案）。

我有一个包含 3 列的文件 1。第 2 列和第 3 列定义了一个数值范围。数据在第 2 列中从小到大排序。数值范围从不重叠。

文件1

S   24     96
S   126    352
S   385    465
S   548    600
S   621    707
S   724    736

我有第二个文件 2（测试），结构类似。

文件2

S   27     93
S   123    348
S   542    584
S   726    740
S   1014   2540
S   12652  12987

期望的输出：打印 file1 中的所有行及其旁边的 file2 中数值范围与 file1 重叠（包括部分重叠）的行。如果 file2 的范围与 file1 的范围没有重叠，则在文件 1 的范围旁边打印零。

S   24    96     S   27    93       * 27-93 overlaps with 24-96
S   126   352    S   123   355      * 123-355 overlaps with 126-352
S   385   465    0                  * nothing in file2 overlaps with this range
S   548   600    S   542   584      * 542-584 overlaps with 548-600
S   621   707    0                  * nothing in file2 overlaps with this range
S   724   736    S   726   740      * 726-740 overlaps with 724-736

根据@EdMorton 对上一个问题的回答，我修改了 tst.awk 脚本的打印命令以添加这些新功能。此外，我还将命令 file1/file2 更改为 file2/file1 以打印 file1 中的所有行（无论第二个文件是否匹配）

'NR == FNR {
begs2ends[] = 
next
}
{
for (beg in begs2ends) {
    end = begs2ends[beg] + 0
    beg += 0
    if (    ( ( >= beg) && ( <= end) ) ||
            ( ( >= beg) && ( <= end) ) ||
            ( ( <= beg) && ( >= end) )  ) {
        print [=13=],"\t",,"\t",beg,"\t",end
    else 
        print [=13=],"\t","0"
        next
    }
}
}

注意：$1 在文件 1 和文件 2 中是相同的。这就是为什么我使用 print ... $1 来让它出现的原因。不知道如何从 file2 而不是 file1 打印它（如果我理解正确的话，这个 $1 指的是 file1.

然后我用 awk -f tst.awk file2 file1

启动分析

脚本不接受 else 参数，我不明白为什么？我假设它与循环有关，但我尝试了一些更改但没有成功。如果你能帮助我，谢谢。

Answer 1

假设：

file1 中的一个范围只能与 file2

当前代码几乎是正确的，只需要在大括号的位置上做一些工作（使用一些一致的缩进有助于）：

awk '
BEGIN     { OFS="\t" }                                 # output field delimiter is "\t"
NR == FNR { begs2ends[] = ; next } 
          {
            # =                                    # uncomment to have current line ([=10=]) reformatted with "\t" delimiters during print
            for (beg in begs2ends) {
                end = begs2ends[beg] + 0
                beg += 0
               if ( ( ( >= beg) && ( <= end) ) ||
                    ( ( >= beg) && ( <= end) ) ||
                    ( ( <= beg) && ( >= end) ) ) {
                  print [=10=],,beg,end                  # spacing within [=10=] unchanged, 3 new fields prefaced with "\t"
                  next
               }
            }

            # if we get this far it is because we have exhausted the "for" loop
            # (ie, found no overlaps) so print current line + "0"

            print [=10=],"0"                               # spacing within [=10=] unchanged, 1 new field prefaced with "\t"
          }
' file2 file1

这会生成：

S   24     96   S       27      93
S   126    352  S       123     348
S   385    465  0
S   548    600  S       542     584
S   621    707  0
S   724    736  S       726     740

取消注释 = 行后，输出变为：

S       24      96      S       27      93
S       126     352     S       123     348
S       385     465     0
S       548     600     S       542     584
S       621     707     0
S       724     736     S       726     740
S       900     1000    S       901     905

Answer 2

@markp-fuso 的答案略有不同

适用于 GNU awk：另存为 overlaps.awk

BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" }
function in_range(val, min, max) { return min <= val && val <= max }
NR == FNR {
    line[FNR] = [=10=]
    lo[FNR] = 
    hi[FNR] = 
    next
}
{
    overlap = "0"
    for (i in line) {
        if (in_range(lo[i], , ) || in_range(hi[i], , )) {
            overlap = line[i]
            delete line[i]
            break
        }
    }
    print [=10=], overlap
}

然后

gawk -f overlaps.awk file2 file1 | column -t

产出

S  24   96   S  27   93
S  126  352  S  123  348
S  385  465  0
S  548  600  S  542  584
S  621  707  0
S  724  736  S  726  740

Answer 3

$ cat tst.awk
BEGIN { OFS="\t" }
NR == FNR {
    ranges[++numRanges] = [=10=]
    next
}
{
    overlapped = 0
    for ( i=1; i<=numRanges; i++ ) {
        range = ranges[i]
        split(range,vals)
        beg = vals[2]+0
        end = vals[3]+0
        if (    ( ( >= beg) && ( <= end) ) ||
                ( ( >= beg) && ( <= end) ) ||
                ( ( <= beg) && ( >= end) )  ) {
            overlapped = 1
            break
        }
    }

    if ( overlapped ) {
        print [=10=], range, sprintf("* %d-%d overlaps with %d-%d", beg, end, , )
    }
    else {
        print [=10=], 0, sprintf("* nothing in %s overlaps with this range", ARGV[1])
    }
}

$ awk -f tst.awk file2 file1 | column -s$'\t' -t
S   24     96   S   27     93   * 27-93 overlaps with 24-96
S   126    352  S   123    348  * 123-348 overlaps with 126-352
S   385    465  0               * nothing in file2 overlaps with this range
S   548    600  S   542    584  * 542-584 overlaps with 548-600
S   621    707  0               * nothing in file2 overlaps with this range
S   724    736  S   726    740  * 726-740 overlaps with 724-736

使用 awk 比较两个不同文件中的两个数值范围，并打印 file1 中的所有行和 file2 中的匹配行

Compare two numerical ranges in two distincts files with awk and print ALL lines from file1 and the matching ones from file2

awk