如何构建包含许多计算值的制表符分隔文本文件？

Question

背景

我正在从一组类似的参考文件（sample1.txt、sample2.txt 等）中计算或检索许多不同参数的数据（pc.genes、pc.transcripts, pc.genes.反义等).

单个 ref.file 的简化示例（例如，sample1.txt）：

word1   word2   word3   405438   409170   .   Y   .   word4; word5
word1   word2   word3   405438   409170   .   N   .   word4; word5
word1   word2   word3   409006   409170   .   N   .   word4; word5
word1   word2   word3   405438   408401   .   Y   .   word4; word5
word1   word2   word3   407099   408361   .   N   0   word4; word5

“avg.exons”参数的计算可能如下所示：

$ awk ' == "word3"' sample1.txt | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}'
5.96732

“pc.genes”参数的检索可能如下所示：

$ awk ' == "word3"' sample1.txt | grep -c "word4"
19062

这些只是示例，以防解决方案要求将命令通过管道传输到函数，然后 transfers/adds 将它们传输到 table。这些命令的输出值总是一个数字。

期望的输出

我想将这些 calculated/retrieved 值放入有组织的 table 格式（最好是制表符分隔的文本文件），以便我可以从数据生成图表：

ref.file    pc.genes    pc.transcripts  pc.genes.antisense  pc.genes.sense  avg.exons   avg.genelength
sample1.txt 19062   116573  2585576 1318321 5.96732 3732.57
sample2.txt 19753   138563  5834759 1433785 5.84654 4023.89
sample3.txt 19376   124576  2871235 1983263 6.78929 3890.32

这可能吗？如果是这样，我该如何实现？

尝试

for file in sample*.txt
do
    printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'
    pc.genes=$(awk ' == "word3"' ${file} | grep -c "word4")
    avg.exons=$(awk ' == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
    ... # get rest of desired values
done > table.txt

产生的错误

-bash: pc.genes=19062: command not found
... # other errors with corresponding CORRECT value outputs
-bash: avg.exons=5.96732: command not found
... # the errors even continue into the other sample*.txt files, which is good
-bash: pc.genes=19753: command not found
...

与给定参数（即“=###”）相对应的所有值都是正确的，但错误导致无法将它们放入 table.

Answer 1

仅基于 OP 提供的详细信息，并假设使用循环构造一次处理单个文件，例如：

# print header
printf "ref.file\tpc.genes\tpc.transcripts\tpc.genes.antisense\tpc.genes.sense\tavg.exons\tavg.genelength\n"

while read -r fn
do
    aexons=$(awk ' == "word1"' ${fn} | sed -n 's/.*word2 \([^;]*\).*word3 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
    pgenes=$(awk ' == "word1"' ${fn} | grep -c "word2")
    ... # get rest of desired values

    # print tab-delimited output to stdout; adjust formats as needed
    printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${fn} ${pgenes} .... ${axeons} ...

done < <('ls' sample*.txt)  # replace with whatever logic OP is using to find desired files

虽然上面的方法应该有效，但对于所有子进程调用（$(...)；管道命令）来说效率不是很高，并且需要处理每个输入文件（${fn}）6 次（对于 6x 值）。

更有效的方法是只处理每个输入文件 (${fn}) 一次。

一个额外的步骤可能是消除循环以支持单个程序一次性处理所有文件。

由于 awk 能够解析数据（来自多个文件），计算 sums/averages，并生成（制表符分隔的）输出，我可能倾向于单个 awk command/invocation 作为更有效的解决方案...但如果没有示例数据和有关所需计算的更多详细信息，则无法确定。

Answer 2

以下答案完美无缺，来自 markp-fuso 和 KamilCuk 的综合建议。谢谢两位！

# add the table headers
printf "%s\n" ref.file pc.genes pc.transcripts pc.genes.antisense pc.genes.sense avg.exons avg.genelength | paste -sd $'\t'

for file in sample*.txt
do
# create variables containing code for all parameter calculations/retrievals
pcgenes=$(awk ' == "word3"' ${file} | grep -c "word4")
pctranscripts=$(...)
pcgenesantisense=$(...)
pcgenessense=$(...)
avgexons=$(awk ' == "word3"' ${file} | sed -n 's/.*word4 \([^;]*\).*word5 \([^;]*\).*/;/p' | awk -F';' '{a[]++}END{for (i in a) {count++; sum += a[i]} print sum/count}')
avggenelength=$(...)

# print all resulting values in a single tab separated row of the table
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" ${file} ${pcgenes} ${pctranscripts} ${pcgenesantisense} ${pcgenessense} ${avgexons} ${avggenelength}
done > table.txt

如何构建包含许多计算值的制表符分隔文本文件？

How to build a tab-delimited text file with many calculated values?

bash

text-processing

dataframe

背景

期望的输出

尝试

产生的错误