合并两个 WIG 文件

Question

我尝试合并多个 variableStep 格式的 WIG 文件。这些文件用于在基因组上下文中存储碱基对 - 分数对。

例如文件 A 和 B 可能如下所示：

A

variableStep chrom=chr1
9967151 1.0
9967152 0.5
variableStep chrom=chr2
1107150 0.2
1107151 0.3

B

variableStep chrom=chr1
10001000 0.3
10001001 0.4
variableStep chrom=chr2
20002000 0.9
20002001 0.7

我要找的最终结果是这样的：

AxB

variableStep chrom=chr1
9967151 1.0
9967152 0.5
10001000 0.3
10001001 0.4
variableStep chrom=chr2
1107150 0.2
1107151 0.3
20002000 0.9
20002001 0.7

你知道我如何有效地合并两个这样的文件的有效版本吗？

它们可能会变得相当大，我想避免循环遍历它并过多地跟踪 header 行。例如，我正在寻找 bash 命令或 R 中存在的东西。

我目前拥有的大型最终文件的一小部分大小约为 800 MB，但随着时间的推移，这将急剧增长到数十到数百 GB 的范围。

最好按块对它进行数字排序，但能够使用它并不重要。

该文件将在稍后的步骤中转换为所谓的 BigWIG 格式的较小二进制文件。

背景：

我想建立一个大文件来存储人类基因组中所有碱基对的信息，所有染色体都在一个文件中。由于我存储的分数会随着时间的推移在较小的单个文件中计算，所以每次我得到一个新的片段（如 A 或 B）时，我都想更新最终的大文件。理想情况下，这些甚至应该是唯一的，因此应该仅集成行当它们还不可用时，但现在简单的合并就足够了

到目前为止我尝试了什么：

我知道如何使用 cat 来连接文件，但这里的问题是我的 header 行比它应该有的要多

另外，我有一个解决方案，我循环遍历每一行，跟踪我当前拥有的染色体并将这些行附加到正确的位置。但是，对于大型数据集来说，这需要相当长的时间

在 R 中使用 rtracklayer::import，我可以读取 WIG 文件，附加它们，使它们唯一并再次导出它们。这行得通，但是每次我想更新时将整个“数据库”文件读入内存非常耗费内存

谢谢！

Answer 1

假设：

无需（重新）排序数据
'small' 文件可以放入内存
保留重复分数

设置：

$ cat smallfile
variableStep chrom=chr1          # merge
10001000 0.3
10001001 0.4
variableStep chrom=chr2          # merge
1107150 0.2                      # duplicate
20002000 0.9
20002001 0.7
variableStep chrom=chr78         # new
3341000 0.3
3526001 0.4

$ cat bigfile
variableStep chrom=chr1
9967151 1.0
9967152 0.5
variableStep chrom=chr117
2347150 0.2
2347151 0.3
variableStep chrom=chr2
1107150 0.2
1107151 0.3
variableStep chrom=chr36
4727150 0.2
4727151 0.3

一个GNU awk（对于multi-dimensional数组）想法：

awk '
### 1st file / smallfile
FNR==NR { if ( == "variableStep") {      # if new header line then reset variables ...
             chrom=
             n=0
             next
          }
          scores[chrom][++n]=[=11=]            # store non-header lines in scores[] array
          next
        }

### 2nd file / bigfile
        { print                            # print current line
          if ( == "variableStep")        # if current line is a header line and ...
             if ( in scores) {           # the "chrom=chrX" is an index in the scores[] array then ...
                for (n in scores[])      # loop through 2nd index and ...
                    print scores[][n]    # print the array entry to stdout (ie, add to our output stream)
                delete scores[]          # delete data to keep from being added again in the "END" block
             }
        }

### add "new" entries (from 1st file / smallfile)
END     { for (chrom in scores) {
              print "variableStep", chrom
              for (n in scores[chrom])
                  print scores[chrom][n]
          }
        }
' smallfile bigfile > newbigfile

这会生成：

$ cat newbigfile
variableStep chrom=chr1      # merged
10001000 0.3                 # added
10001001 0.4                 # added
9967151 1.0
9967152 0.5
variableStep chrom=chr117
2347150 0.2
2347151 0.3
variableStep chrom=chr2      # merged
1107150 0.2                  # added; duplicate
20002000 0.9                 # added
20002001 0.7                 # added
1107150 0.2                  #        duplicate
1107151 0.3
variableStep chrom=chr36
4727150 0.2
4727151 0.3
variableStep chrom=chr78     # added; new
3341000 0.3                  # added
3526001 0.4                  # added

备注：

可以添加排序，但会增加内存使用量（如果只是在给定的 header 行中排序分数，则可能不是问题）
删除重复的分数是可行的，内存使用量的增加可以忽略不计
将“新”条目添加到 bigfile 中 'header line' 排序顺序需要更多编码才能实现

合并两个 WIG 文件

Merging two WIG files

bash

bioinformatics