按行号和列号子集文件

Question

我们想根据行和列对文本文件进行子集化，其中行号和列号是从文件中读取的。不包括 header（第 1 行）和行名（第 1 列）。

inputFile.txt 制表符分隔的文本文件

header  62  9   3   54  6   1
25  1   2   3   4   5   6
96  1   1   1   1   0   1
72  3   3   3   3   3   3
18  0   1   0   1   1   0
82  1   0   0   0   0   1
77  1   0   1   0   1   1
15  7   7   7   7   7   7
82  0   0   1   1   1   0
37  0   1   0   0   1   0
18  0   1   0   0   1   0
53  0   0   1   0   0   0
57  1   1   1   1   1   1

subsetCols.txt 逗号分隔，无空格，一行，数字有序。在实际数据中，我们有 500K 列，需要对 ~10K 进行子集化。

1,4,6

subsetRows.txt 逗号分隔，无空格，一行，数字有序。在实际数据中，我们有 20K 行，需要约 300 个子集。

1,3,7

当前使用cut和awk循环的解决方案（Related post: Select rows using awk）：

# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt

# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput

输出文件：result.txt

1   4   6
3   3   3
7   7   7

问题：
此解决方案适用于小文件，对于 50K 行和 200K 列的大文件，它花费的时间太长，超过 15 分钟，仍然运行。我认为 cut 调整列工作正常，选择行是慢一点。

有什么更好的方法吗？

真实输入文件信息：

# $fileInput:
#        Rows = 20127
#        Cols = 533633
#        Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers

有关文件的更多信息：文件包含 GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r。

系统:

$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux

更新： 下面由提供的解决方案混合了我系统中列的顺序，因为我使用的是不同版本的 awk，我的版本是：GNU Awk 3.1.7

Answer 1

尽管在 If programming languages were countries, which country would each language represent? 他们说...

Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.

...每当您看到自己正在运行 sed、cut、grep、awk 等时，停下来对自己说：awk 可以独立完成！

所以在这种情况下，需要提取行和列（调整它们以排除 header 和第一列），然后缓冲输出以最终打印出来。

awk -v cols="1 4 6" -v rows="1 3 7" '
    BEGIN{
       split(cols,c); for (i in c) col[c[i]]  # extract cols to print
       split(rows,r); for (i in r) row[r[i]]  # extract rows to print
    }
    (NR-1 in row){
       for (i=2;i<=NF;i++) 
              (i-1) in col && line=(line ? line OFS $i : $i); # pick columns
              print line; line=""                             # print them
    }' file

使用您的示例文件：

$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7

使用您的示例文件，并将输入作为变量，以逗号分隔：

awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput

我很确定这会更快。例如，您可以检查 Remove duplicates from text file based on second text file 的一些基准比较 awk 相对于 grep 和其他人的性能。

Answer 2

Gnu awk 版本 4.0 或更高版本中的一个，因为列排序依赖于 for 和 PROCINFO["sorted_in"]。从文件中读取行号和列号：

$ awk '
BEGIN {
    PROCINFO["sorted_in"]="@ind_num_asc";
}
FILENAME==ARGV[1] {                       # process rows file
    n=split([=10=],t,","); 
    for(i=1;i<=n;i++) r[t[i]]
} 
FILENAME==ARGV[2] {                       # process cols file
    m=split([=10=],t,","); 
    for(i=1;i<=m;i++) c[t[i]]
} 
FILENAME==ARGV[3] && ((FNR-1) in r) {     # process data file
    for(i in c) 
        printf "%s%s", $(i+1), (++j%m?OFS:ORS)
}' subsetRows.txt subsetCols.txt inputFile.txt   
1 4 6
3 3 3
7 7 7

一些性能提升可能来自将 ARGV[3] 处理块移动到顶部的 berore 1 和 2 并在其末尾添加 next。

Answer 3

不要从这两个很好的答案中拿走任何东西。仅仅因为这个问题涉及大量数据，我发布了 2 个答案的组合以加快处理速度。

awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
   n = split(cols, c, /,/)
   split(rows, r, /,/)
   for (i in r)
      row[r[i]]
}
(NR-1) in row {
   for (i=1; i<=n; i++)
      printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt

PS: 这也适用于较旧的 awk 版本或非 gnu awk。

Answer 4

改进@anubhava 解决方案我们可以摆脱为每行搜索超过 10k 个值利用输入已经排序的事实来查看我们是否在正确的行上

awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
   n = split(cols, c, /,/)
   split(rows, r, /,/)
   j=1;
}
(NR-1) == r[j] { 
   j++
   for (i=1; i<=n; i++)
      printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt

Answer 5

Python 有一个 csv 模块。您将一行读入列表，将所需的列打印到标准输出，冲洗，清洗，重复。

这应该将列切片 20,000 到 30,000。

import csv
with open('foo.txt') as f:
    gwas = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
    for row in gwas:
        print(row[20001:30001]

按行号和列号子集文件

Subset a file by row and column numbers

bash

awk

cut

bioinformatics

subset