如何构建包含许多计算值的制表符分隔文本文件?
How to build a tab-delimited text file with many calculated values?
背景
我正在从一组类似的参考文件(sample1.txt、sample2.txt 等)中计算或检索许多不同参数的数据(pc.genes、pc.transcripts, pc.genes.反义等).
单个 ref.file
的简化 示例(例如,sample1.txt):
word1 word2 word3 405438 409170 . Y . word4; word5
word1 word2 word3 405438 409170 . N . word4; word5
word1 word2 word3 409006 409170 . N . word4; word5
word1 word2 word3 405438 408401 . Y . word4; word5
word1 word2 word3 407099 408361 . N 0 word4; word5
“avg.exons”参数的计算可能如下所示:
$ awk ' == "word3"' sample1.txt | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}'
5.96732
“pc.genes”参数的检索可能如下所示:
$ awk ' == "word3"' sample1.txt | grep -c "word4"
19062
这些只是示例,以防解决方案要求将命令通过管道传输到函数,然后 transfers/adds 将它们传输到 table。这些命令的输出值总是一个数字。
期望的输出
我想将这些 calculated/retrieved 值放入有组织的 table 格式(最好是制表符分隔的文本文件),以便我可以从数据生成图表:
ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength
sample1.txt 19062 116573 2585576 1318321 5.96732 3732.57
sample2.txt 19753 138563 5834759 1433785 5.84654 4023.89
sample3.txt 19376 124576 2871235 1983263 6.78929 3890.32
这可能吗?如果是这样,我该如何实现?
尝试
for file in sample*.txt
do
printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'
pc.genes=$(awk ' == "word3"' ${file} | grep -c "word4")
avg.exons=$(awk ' == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
... # get rest of desired values
done > table.txt
产生的错误
-bash: pc.genes=19062: command not found
... # other errors with corresponding CORRECT value outputs
-bash: avg.exons=5.96732: command not found
... # the errors even continue into the other sample*.txt files, which is good
-bash: pc.genes=19753: command not found
...
与给定参数(即“=###
”)相对应的所有 值 都是正确的,但错误导致无法将它们放入 table.
仅基于 OP 提供的详细信息,并假设使用循环构造一次处理单个文件,例如:
# print header
printf "ref.file\tpc.genes\tpc.transcripts\tpc.genes.antisense\tpc.genes.sense\tavg.exons\tavg.genelength\n"
while read -r fn
do
aexons=$(awk ' == "word1"' ${fn} | sed -n 's/.*word2 \([^;]*\).*word3 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
pgenes=$(awk ' == "word1"' ${fn} | grep -c "word2")
... # get rest of desired values
# print tab-delimited output to stdout; adjust formats as needed
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${fn} ${pgenes} .... ${axeons} ...
done < <('ls' sample*.txt) # replace with whatever logic OP is using to find desired files
虽然上面的方法应该有效,但对于所有子进程调用($(...)
;管道命令)来说效率不是很高,并且需要处理每个输入文件(${fn}
)6 次(对于 6x 值)。
更有效的方法是只处理每个输入文件 (${fn}
) 一次。
一个额外的步骤可能是消除循环以支持单个程序一次性处理所有文件。
由于 awk
能够解析数据(来自多个文件),计算 sums/averages,并生成(制表符分隔的)输出,我可能倾向于单个 awk
command/invocation 作为更有效的解决方案...但如果没有示例数据和有关所需计算的更多详细信息,则无法确定。
以下答案完美无缺,来自 markp-fuso 和 KamilCuk 的综合建议。谢谢两位!
# add the table headers
printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'
for file in sample*.txt
do
# create variables containing code for all parameter calculations/retrievals
pcgenes=$(awk ' == "word3"' ${file} | grep -c "word4")
pctranscripts=$(...)
pcgenesantisense=$(...)
pcgenessense=$(...)
avgexons=$(awk ' == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
avggenelength=$(...)
# print all resulting values in a single tab separated row of the table
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${file} ${pcgenes} ${pctranscripts} ${pcgenesantisense} ${pcgenessense} ${avgexons} ${avggenelength}
done > table.txt
背景
我正在从一组类似的参考文件(sample1.txt、sample2.txt 等)中计算或检索许多不同参数的数据(pc.genes、pc.transcripts, pc.genes.反义等).
单个 ref.file
的简化 示例(例如,sample1.txt):
word1 word2 word3 405438 409170 . Y . word4; word5
word1 word2 word3 405438 409170 . N . word4; word5
word1 word2 word3 409006 409170 . N . word4; word5
word1 word2 word3 405438 408401 . Y . word4; word5
word1 word2 word3 407099 408361 . N 0 word4; word5
“avg.exons”参数的计算可能如下所示:
$ awk ' == "word3"' sample1.txt | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}'
5.96732
“pc.genes”参数的检索可能如下所示:
$ awk ' == "word3"' sample1.txt | grep -c "word4"
19062
这些只是示例,以防解决方案要求将命令通过管道传输到函数,然后 transfers/adds 将它们传输到 table。这些命令的输出值总是一个数字。
期望的输出
我想将这些 calculated/retrieved 值放入有组织的 table 格式(最好是制表符分隔的文本文件),以便我可以从数据生成图表:
ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength
sample1.txt 19062 116573 2585576 1318321 5.96732 3732.57
sample2.txt 19753 138563 5834759 1433785 5.84654 4023.89
sample3.txt 19376 124576 2871235 1983263 6.78929 3890.32
这可能吗?如果是这样,我该如何实现?
尝试
for file in sample*.txt
do
printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'
pc.genes=$(awk ' == "word3"' ${file} | grep -c "word4")
avg.exons=$(awk ' == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
... # get rest of desired values
done > table.txt
产生的错误
-bash: pc.genes=19062: command not found
... # other errors with corresponding CORRECT value outputs
-bash: avg.exons=5.96732: command not found
... # the errors even continue into the other sample*.txt files, which is good
-bash: pc.genes=19753: command not found
...
与给定参数(即“=###
”)相对应的所有 值 都是正确的,但错误导致无法将它们放入 table.
仅基于 OP 提供的详细信息,并假设使用循环构造一次处理单个文件,例如:
# print header
printf "ref.file\tpc.genes\tpc.transcripts\tpc.genes.antisense\tpc.genes.sense\tavg.exons\tavg.genelength\n"
while read -r fn
do
aexons=$(awk ' == "word1"' ${fn} | sed -n 's/.*word2 \([^;]*\).*word3 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
pgenes=$(awk ' == "word1"' ${fn} | grep -c "word2")
... # get rest of desired values
# print tab-delimited output to stdout; adjust formats as needed
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${fn} ${pgenes} .... ${axeons} ...
done < <('ls' sample*.txt) # replace with whatever logic OP is using to find desired files
虽然上面的方法应该有效,但对于所有子进程调用($(...)
;管道命令)来说效率不是很高,并且需要处理每个输入文件(${fn}
)6 次(对于 6x 值)。
更有效的方法是只处理每个输入文件 (${fn}
) 一次。
一个额外的步骤可能是消除循环以支持单个程序一次性处理所有文件。
由于 awk
能够解析数据(来自多个文件),计算 sums/averages,并生成(制表符分隔的)输出,我可能倾向于单个 awk
command/invocation 作为更有效的解决方案...但如果没有示例数据和有关所需计算的更多详细信息,则无法确定。
以下答案完美无缺,来自 markp-fuso 和 KamilCuk 的综合建议。谢谢两位!
# add the table headers
printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'
for file in sample*.txt
do
# create variables containing code for all parameter calculations/retrievals
pcgenes=$(awk ' == "word3"' ${file} | grep -c "word4")
pctranscripts=$(...)
pcgenesantisense=$(...)
pcgenessense=$(...)
avgexons=$(awk ' == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
avggenelength=$(...)
# print all resulting values in a single tab separated row of the table
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${file} ${pcgenes} ${pctranscripts} ${pcgenesantisense} ${pcgenessense} ${avgexons} ${avggenelength}
done > table.txt