使用 awk 比较两个不同文件中的两个数值范围,并打印 file1 中的所有行和 file2 中的匹配行

Compare two numerical ranges in two distincts files with awk and print ALL lines from file1 and the matching ones from file2

这个新问题是最近一个问题的后续问题:。提议的完美工作的解决方案对于下游分析并不实用(对我的问题的误解,而不是对有效的解决方案)。

我有一个包含 3 列的文件 1。第 2 列和第 3 列定义了一个数值范围。数据在第 2 列中从小到大排序。数值范围从不重叠。

文件1

S   24     96
S   126    352
S   385    465
S   548    600
S   621    707
S   724    736

我有第二个文件 2(测试),结构类似。

文件2

S   27     93
S   123    348
S   542    584
S   726    740
S   1014   2540
S   12652  12987

期望的输出:打印 file1 中的所有行及其旁边的 file2 中数值范围与 file1 重叠(包括部分重叠)的行。如果 file2 的范围与 file1 的范围没有重叠,则在文件 1 的范围旁边打印零。

S   24    96     S   27    93       * 27-93 overlaps with 24-96
S   126   352    S   123   355      * 123-355 overlaps with 126-352
S   385   465    0                  * nothing in file2 overlaps with this range
S   548   600    S   542   584      * 542-584 overlaps with 548-600
S   621   707    0                  * nothing in file2 overlaps with this range
S   724   736    S   726   740      * 726-740 overlaps with 724-736

根据@EdMorton 对上一个问题的回答,我修改了 tst.awk 脚本的打印命令以添加这些新功能。此外,我还将命令 file1/file2 更改为 file2/file1 以打印 file1 中的所有行(无论第二个文件是否匹配)

'NR == FNR {
begs2ends[] = 
next
}
{
for (beg in begs2ends) {
    end = begs2ends[beg] + 0
    beg += 0
    if (    ( ( >= beg) && ( <= end) ) ||
            ( ( >= beg) && ( <= end) ) ||
            ( ( <= beg) && ( >= end) )  ) {
        print [=13=],"\t",,"\t",beg,"\t",end
    else 
        print [=13=],"\t","0"
        next
    }
}
}

注意:$1 在文件 1 和文件 2 中是相同的。这就是为什么我使用 print ... $1 来让它出现的原因。不知道如何从 file2 而不是 file1 打印它(如果我理解正确的话,这个 $1 指的是 file1.

然后我用 awk -f tst.awk file2 file1

启动分析

脚本不接受 else 参数,我不明白为什么?我假设它与循环有关,但我尝试了一些更改但没有成功。 如果你能帮助我,谢谢。

假设:

  • file1 中的一个范围只能与 file2
  • 中的一个范围重叠

当前代码几乎是正确的,只需要在大括号的位置上做一些工作(使用一些一致的缩进有助于):

awk '
BEGIN     { OFS="\t" }                                 # output field delimiter is "\t"
NR == FNR { begs2ends[] = ; next } 
          {
            # =                                    # uncomment to have current line ([=10=]) reformatted with "\t" delimiters during print
            for (beg in begs2ends) {
                end = begs2ends[beg] + 0
                beg += 0
               if ( ( ( >= beg) && ( <= end) ) ||
                    ( ( >= beg) && ( <= end) ) ||
                    ( ( <= beg) && ( >= end) ) ) {
                  print [=10=],,beg,end                  # spacing within [=10=] unchanged, 3 new fields prefaced with "\t"
                  next
               }
            }

            # if we get this far it is because we have exhausted the "for" loop
            # (ie, found no overlaps) so print current line + "0"

            print [=10=],"0"                               # spacing within [=10=] unchanged, 1 new field prefaced with "\t"
          }
' file2 file1

这会生成:

S   24     96   S       27      93
S   126    352  S       123     348
S   385    465  0
S   548    600  S       542     584
S   621    707  0
S   724    736  S       726     740

取消注释 = 行后,输出变为:

S       24      96      S       27      93
S       126     352     S       123     348
S       385     465     0
S       548     600     S       542     584
S       621     707     0
S       724     736     S       726     740
S       900     1000    S       901     905

@markp-fuso 的答案略有不同

适用于 GNU awk:另存为 overlaps.awk

BEGIN { PROCINFO["sorted_in"] = "@ind_num_asc" }
function in_range(val, min, max) { return min <= val && val <= max }
NR == FNR {
    line[FNR] = [=10=]
    lo[FNR] = 
    hi[FNR] = 
    next
}
{
    overlap = "0"
    for (i in line) {
        if (in_range(lo[i], , ) || in_range(hi[i], , )) {
            overlap = line[i]
            delete line[i]
            break
        }
    }
    print [=10=], overlap
}

然后

gawk -f overlaps.awk file2 file1 | column -t

产出

S  24   96   S  27   93
S  126  352  S  123  348
S  385  465  0
S  548  600  S  542  584
S  621  707  0
S  724  736  S  726  740
$ cat tst.awk
BEGIN { OFS="\t" }
NR == FNR {
    ranges[++numRanges] = [=10=]
    next
}
{
    overlapped = 0
    for ( i=1; i<=numRanges; i++ ) {
        range = ranges[i]
        split(range,vals)
        beg = vals[2]+0
        end = vals[3]+0
        if (    ( ( >= beg) && ( <= end) ) ||
                ( ( >= beg) && ( <= end) ) ||
                ( ( <= beg) && ( >= end) )  ) {
            overlapped = 1
            break
        }
    }

    if ( overlapped ) {
        print [=10=], range, sprintf("* %d-%d overlaps with %d-%d", beg, end, , )
    }
    else {
        print [=10=], 0, sprintf("* nothing in %s overlaps with this range", ARGV[1])
    }
}

$ awk -f tst.awk file2 file1 | column -s$'\t' -t
S   24     96   S   27     93   * 27-93 overlaps with 24-96
S   126    352  S   123    348  * 123-348 overlaps with 126-352
S   385    465  0               * nothing in file2 overlaps with this range
S   548    600  S   542    584  * 542-584 overlaps with 548-600
S   621    707  0               * nothing in file2 overlaps with this range
S   724    736  S   726    740  * 726-740 overlaps with 724-736