协助编写awk脚本版本python代码生成计数矩阵
Assist writing awk script version of python code to generate count matrix
我还没有发现任何类似的问题...
我有这个 python 脚本可以从仅包含序列的文件中生成计数矩阵,但它需要永恒才能 运行 但我知道 awk 会做得更快。我不太擅长 awk,但希望有人能提供帮助。
python脚本如下:
numFiles = int(sys.argv[1])
allParams = int(numFiles + 4)
key_file = sys.argv[2]
out_file = sys.argv[3]
#open the output file
outHandle = open(out_file,'w')
#Open key file and read one line at a time
with open(key_file) as kf:
for eachline in kf:
temp_list = [0] * numFiles
kSeq = eachline.strip(' \t\n\r')
upRange = int(numFiles + 4)
for i in range(4,upRange):
with open(sys.argv[i]) as f:
for eachline in f:
seq = eachline.strip(' \t\n\r')
if (kSeq == seq):
curr = int(temp_list[i-4])
nw = int(curr + 1)
temp_list[i-4] = nw
else:
continue
outHandle.write(str(kSeq) + "\t")
for ind,item in enumerate(temp_list):
lastItemIndex = numFiles - 1
if(ind == lastItemIndex):
outHandle.write(str(item) + "\n")
else:
outHandle.write(str(item) + "\t")
正在尝试创建示例:
输入:一个keyFile,X个其他文件(所有输入文件基本上只是单列中的单词)
输出:一个矩阵,其中包含 X 个文件中 keyFile 中的单词出现的次数。
密钥文件:
word
one
two
three
four
five
文件 1:
word
three
five
three
one
two
one
four
four
three
文件 2:
word
four
one
three
three
one
two
three
two
one
输出:
word
file1
file2
one
2
3
two
1
2
three
3
3
four
2
1
five
1
0
最大文件数为4
我希望这个插图更清楚。
谢谢
所以,经过大量阅读和尝试,我得到了我想要使用代码实现的目标
awk 'fname != FILENAME { fname = FILENAME; idx++ } idx == 1 {key[[=10=]] = [=10=] } idx == 2 {if( == key[]){ f1[] += 1 }} idx == 3 {if( == key[]){ f2[] += 1 }} END {for(seq in key) print seq "\t" f1[seq] "\t" f2[seq] }' keyFile file1 file2
感谢大家的参与。
我还没有发现任何类似的问题...
我有这个 python 脚本可以从仅包含序列的文件中生成计数矩阵,但它需要永恒才能 运行 但我知道 awk 会做得更快。我不太擅长 awk,但希望有人能提供帮助。 python脚本如下:
numFiles = int(sys.argv[1])
allParams = int(numFiles + 4)
key_file = sys.argv[2]
out_file = sys.argv[3]
#open the output file
outHandle = open(out_file,'w')
#Open key file and read one line at a time
with open(key_file) as kf:
for eachline in kf:
temp_list = [0] * numFiles
kSeq = eachline.strip(' \t\n\r')
upRange = int(numFiles + 4)
for i in range(4,upRange):
with open(sys.argv[i]) as f:
for eachline in f:
seq = eachline.strip(' \t\n\r')
if (kSeq == seq):
curr = int(temp_list[i-4])
nw = int(curr + 1)
temp_list[i-4] = nw
else:
continue
outHandle.write(str(kSeq) + "\t")
for ind,item in enumerate(temp_list):
lastItemIndex = numFiles - 1
if(ind == lastItemIndex):
outHandle.write(str(item) + "\n")
else:
outHandle.write(str(item) + "\t")
正在尝试创建示例:
输入:一个keyFile,X个其他文件(所有输入文件基本上只是单列中的单词) 输出:一个矩阵,其中包含 X 个文件中 keyFile 中的单词出现的次数。
密钥文件:
word |
---|
one |
two |
three |
four |
five |
文件 1:
word |
---|
three |
five |
three |
one |
two |
one |
four |
four |
three |
文件 2:
word |
---|
four |
one |
three |
three |
one |
two |
three |
two |
one |
输出:
word | file1 | file2 |
---|---|---|
one | 2 | 3 |
two | 1 | 2 |
three | 3 | 3 |
four | 2 | 1 |
five | 1 | 0 |
最大文件数为4
我希望这个插图更清楚。
谢谢
所以,经过大量阅读和尝试,我得到了我想要使用代码实现的目标
awk 'fname != FILENAME { fname = FILENAME; idx++ } idx == 1 {key[[=10=]] = [=10=] } idx == 2 {if( == key[]){ f1[] += 1 }} idx == 3 {if( == key[]){ f2[] += 1 }} END {for(seq in key) print seq "\t" f1[seq] "\t" f2[seq] }' keyFile file1 file2
感谢大家的参与。