对于目录中的每个文件,将文件名和子字符串打印到 csv,
Print Filename and Substring to csv For Each File in a Directory,
我一直在尝试自学 awk 来完成以下任务,但收效甚微。
我有一个包含多个文本文件的目录:
JV-01_S01_L007_R2_002_RepetitiveText_ToRemove.txt
JV-26_S48_L_RepetitiveText_ToRemove.txt
...
每个文本文件的结构如下。 数字可能会改变,但附带的文字将始终保持不变。
JV-01_S01_L007_R2_002_RepetitiveText_ToRemove.txt
4620178 reads; of these:
4620178 (100.00%) were unpaired; of these:
1226814 (26.55%) aligned 0 times
3040861 (65.82%) aligned exactly 1 time
352503 (7.63%) aligned >1 times
73.45% overall alignment rate
JV-26_S48_L_RepetitiveText_ToRemove.txt
1601831 reads; of these:
1601831 (100.00%) were unpaired; of these:
58800 (3.67%) aligned 0 times
1344724 (83.95%) aligned exactly 1 time
198307 (12.38%) aligned >1 times
96.33% overall alignment rate
对于这个目录中的每个文件,我想编译一个 csv 文件:
Sample Total_Reads Uniquely_Mapped_Reads Multi_Mapped_Reads Unmapped_Reads
JV-01_S01_L007_R2_002 4620178 3040861 352503 1226814
JV-26_S48_L 1601831 1344724 198307 58800
...
有没有办法用 awk 的单个 for 循环来做到这一点?我正在尝试使用匹配功能。
例如,如果我可以指定在特定行内进行匹配搜索,然后从左到右搜索由任意数字组成的子字符串,直到找到 space。这将获取该行感兴趣的子字符串。
大致如下:
for file in *.txt
do
awk 'FNR == 1 {print FILENAME, match(NR==1, \d), match(NR==4, \d), match(NR==5, \d), match(NR==3, \d) } ' $file >> Names.csv
能否请您尝试使用显示的示例进行以下、编写和测试。
awk '
BEGIN{
print "Sample Total_Reads Uniquely_Mapped_Reads Multi_Mapped_Reads Unmapped_Reads"
}
FNR==1{
if(total_reads){
print file,total_reads,Uniquely_Mapped_Reads,times,Multi_Mapped_Reads,Unmapped_Reads
}
total_reads=Uniquely_Mapped_Reads=times=Multi_Mapped_Reads=Unmapped_Reads=""
sub(/_RepetitiveText.*/,"",FILENAME)
file=FILENAME
}
/reads; of these/{
total_reads=
next
}
/aligned exactly 1 time/{U
niquely_Mapped_Reads=
next
}
/aligned >1 times/{
Multi_Mapped_Reads=
next
}
/aligned [0-9]+ times/{
Unmapped_Reads=
}
END{
if(total_reads){
print file,total_reads,Uniquely_Mapped_Reads,times,Multi_Mapped_Reads,Unmapped_Reads
}
}
' *.txt | column -t
这是一个简单的方法,但是它需要 GNU awk 来实现多字符 RS。
您可以使用技巧 将文件作为单个记录读取。然后你只需要打印出你想要的字段(这确实取决于你断言文本是固定的)
$ awk -v RS="^$" '{print FILENAME, , , , }' jv-01 jv-26
jv-01 4620178 3040861 352503 1226814
jv-26 1601831 1344724 198307 58800
我一直在尝试自学 awk 来完成以下任务,但收效甚微。
我有一个包含多个文本文件的目录:
JV-01_S01_L007_R2_002_RepetitiveText_ToRemove.txt
JV-26_S48_L_RepetitiveText_ToRemove.txt
...
每个文本文件的结构如下。 数字可能会改变,但附带的文字将始终保持不变。
JV-01_S01_L007_R2_002_RepetitiveText_ToRemove.txt
4620178 reads; of these:
4620178 (100.00%) were unpaired; of these:
1226814 (26.55%) aligned 0 times
3040861 (65.82%) aligned exactly 1 time
352503 (7.63%) aligned >1 times
73.45% overall alignment rate
JV-26_S48_L_RepetitiveText_ToRemove.txt
1601831 reads; of these:
1601831 (100.00%) were unpaired; of these:
58800 (3.67%) aligned 0 times
1344724 (83.95%) aligned exactly 1 time
198307 (12.38%) aligned >1 times
96.33% overall alignment rate
对于这个目录中的每个文件,我想编译一个 csv 文件:
Sample Total_Reads Uniquely_Mapped_Reads Multi_Mapped_Reads Unmapped_Reads
JV-01_S01_L007_R2_002 4620178 3040861 352503 1226814
JV-26_S48_L 1601831 1344724 198307 58800
...
有没有办法用 awk 的单个 for 循环来做到这一点?我正在尝试使用匹配功能。 例如,如果我可以指定在特定行内进行匹配搜索,然后从左到右搜索由任意数字组成的子字符串,直到找到 space。这将获取该行感兴趣的子字符串。
大致如下:
for file in *.txt
do
awk 'FNR == 1 {print FILENAME, match(NR==1, \d), match(NR==4, \d), match(NR==5, \d), match(NR==3, \d) } ' $file >> Names.csv
能否请您尝试使用显示的示例进行以下、编写和测试。
awk '
BEGIN{
print "Sample Total_Reads Uniquely_Mapped_Reads Multi_Mapped_Reads Unmapped_Reads"
}
FNR==1{
if(total_reads){
print file,total_reads,Uniquely_Mapped_Reads,times,Multi_Mapped_Reads,Unmapped_Reads
}
total_reads=Uniquely_Mapped_Reads=times=Multi_Mapped_Reads=Unmapped_Reads=""
sub(/_RepetitiveText.*/,"",FILENAME)
file=FILENAME
}
/reads; of these/{
total_reads=
next
}
/aligned exactly 1 time/{U
niquely_Mapped_Reads=
next
}
/aligned >1 times/{
Multi_Mapped_Reads=
next
}
/aligned [0-9]+ times/{
Unmapped_Reads=
}
END{
if(total_reads){
print file,total_reads,Uniquely_Mapped_Reads,times,Multi_Mapped_Reads,Unmapped_Reads
}
}
' *.txt | column -t
这是一个简单的方法,但是它需要 GNU awk 来实现多字符 RS。
您可以使用技巧
$ awk -v RS="^$" '{print FILENAME, , , , }' jv-01 jv-26
jv-01 4620178 3040861 352503 1226814
jv-26 1601831 1344724 198307 58800