协助编写awk脚本版本python代码生成计数矩阵

Question

我还没有发现任何类似的问题...

我有这个 python 脚本可以从仅包含序列的文件中生成计数矩阵，但它需要永恒才能运行但我知道 awk 会做得更快。我不太擅长 awk，但希望有人能提供帮助。 python脚本如下：

    numFiles = int(sys.argv[1])
    allParams = int(numFiles + 4)
    key_file = sys.argv[2]
    out_file = sys.argv[3]
    #open the output file
    outHandle = open(out_file,'w')
    #Open key file and read one line at a time
    with open(key_file) as kf:
            for eachline in kf:
                    temp_list = [0] * numFiles
                    kSeq = eachline.strip(' \t\n\r')
                    upRange = int(numFiles + 4)

                    for i in range(4,upRange):
                            with open(sys.argv[i]) as f:
                                    for eachline in f:
                                            seq = eachline.strip(' \t\n\r')
                                            if (kSeq == seq):
                                                    curr = int(temp_list[i-4])
                                                    nw = int(curr + 1)
                                                    temp_list[i-4] = nw
                                            else:
                                                    continue

                    outHandle.write(str(kSeq) + "\t")
                    for ind,item in enumerate(temp_list):
                            lastItemIndex = numFiles - 1
                            if(ind == lastItemIndex):
                                    outHandle.write(str(item) + "\n")
                            else:
                                    outHandle.write(str(item) + "\t")

正在尝试创建示例：

输入：一个keyFile，X个其他文件（所有输入文件基本上只是单列中的单词）输出：一个矩阵，其中包含 X 个文件中 keyFile 中的单词出现的次数。

密钥文件：

word
one
two
three
four
five

文件 1:

word
three
five
three
one
two
one
four
four
three

文件 2:

word
four
one
three
three
one
two
three
two
one

输出：

word	file1	file2
one	2	3
two	1	2
three	3	3
four	2	1
five	1	0

最大文件数为4

我希望这个插图更清楚。

谢谢

Answer 1

所以，经过大量阅读和尝试，我得到了我想要使用代码实现的目标

awk 'fname != FILENAME { fname = FILENAME; idx++ } idx == 1 {key[[=10=]] = [=10=] } idx == 2 {if( == key[]){ f1[] += 1 }} idx == 3 {if( == key[]){ f2[] += 1 }} END {for(seq in key) print seq "\t" f1[seq] "\t" f2[seq] }' keyFile file1 file2

感谢大家的参与。

协助编写awk脚本版本python代码生成计数矩阵

Assist writing awk script version of python code to generate count matrix

python

awk

count

matrix

sequence