Ubuntu:如果制表符分隔文件包含特定字符串,我如何仅提取特定列?
Ubuntu: How do I extract only specific columns from tab-delimited file if it contains a specific string?
我想从 FC305JN_s_1_eland_result.txt 文件中提取带有 chr6.fa
的行。然后,我只想从此子文件中提取第 1、2、7、8 和 9 列。
grep -E "chr6.fa" FC305JN_s_1_eland_result.txt > out.txt
awk -F, '{OFS=",";print , , , , }' out.txt > outfile.txt
My out.txt is exactly the same as outfile.txt.
FC305JN_s_1_eland_result.txt
文件的小样本:
>FC305JN_20080525:1:15:1412:166 GTGAATCCTTATTCCGATATATATNNNN U0 1 0 0 chrX.fa 45974622 R ..
>FC305JN_20080525:1:15:944:72 GATGACTTCCTTAATTTTCTTTATNNNN U0 1 0 0 chr6.fa 7200804 R ..
>FC305JN_20080525:1:15:1049:473 GAATGGCAACACAAACAGGGCTGANNNN R2 0 0 4
>FC305JN_20080525:1:15:1196:1959 GGGAGAAGCCTCCCCGCCTCGGCCNNNN U2 0 0 1 chr17.fa 38386704 F .. 17A 23T
>FC305JN_20080525:1:15:1034:505 GAAAATGTTTCAAATCAATTTCTANNNN U0 1 0 0 chr2.fa 183305566 R ..
>FC305JN_20080525:1:15:983:126 GGATAGAGAGTTTGCACTGAGTTGNNNN U0 1 0 0 chrX.fa 92367529 F ..
>FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN U0 1 0 0 chr6.fa 20979453 R ..
>FC305JN_20080525:1:15:743:1028 GAATGGAATGGAATGGAAAGAAACNNNN R1 0 33 255
>FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN U2 0 0 1 chr6.fa 136877852 R .. 7A 13G
当前输出outfile.txt(示例):
>FC305JN_20080525:1:15:944:72 GATGACTTCCTTAATTTTCTTTATNNNN U0 1 0 0 chr6.fa 7200804 R ..
>FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN U0 1 0 0 chr6.fa 20979453 R ..
>FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN U2 0 0 1 chr6.fa 136877852 R .. 7A 13G
期望的输出(示例):
>FC305JN_20080525:1:15:944:72 GATGACTTCCTTAATTTTCTTTATNNNN chr6.fa 7200804 R
>FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN chr6.fa 20979453 R
>FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN chr6.fa 136877852 R
简化您的代码(使用从 借用的代码)
grep -E "chr6.fa" FC305JN_s_1_eland_result.txt > out.txt
awk '{print , "\t", , "\t", , "\t", , "\t", }' out.txt > outfile.txt
产生输出:
FC305JN_20080525:1:15:944:72 GATGACTTCCTTAATTTTCTTTATNNNN chr6.fa 7200804 R
FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN chr6.fa 20979453 R
FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN chr6.fa 136877852 R
完整的 awk:
$ awk 'BEGIN {
FS=OFS="\t" # set correct delimiters
}
~/chr6\.fa/ { # replaces the grep part
print , , , , # output
}' file # your file goes here
输出:
>FC305JN_20080525:1:15:944:72 GATGACTTCCTTAATTTTCTTTATNNNN chr6.fa 7200804 R
>FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN chr6.fa 20979453 R
>FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN chr6.fa 136877852 R
我想从 FC305JN_s_1_eland_result.txt 文件中提取带有 chr6.fa
的行。然后,我只想从此子文件中提取第 1、2、7、8 和 9 列。
grep -E "chr6.fa" FC305JN_s_1_eland_result.txt > out.txt
awk -F, '{OFS=",";print , , , , }' out.txt > outfile.txt
My out.txt is exactly the same as outfile.txt.
FC305JN_s_1_eland_result.txt
文件的小样本:
>FC305JN_20080525:1:15:1412:166 GTGAATCCTTATTCCGATATATATNNNN U0 1 0 0 chrX.fa 45974622 R ..
>FC305JN_20080525:1:15:944:72 GATGACTTCCTTAATTTTCTTTATNNNN U0 1 0 0 chr6.fa 7200804 R ..
>FC305JN_20080525:1:15:1049:473 GAATGGCAACACAAACAGGGCTGANNNN R2 0 0 4
>FC305JN_20080525:1:15:1196:1959 GGGAGAAGCCTCCCCGCCTCGGCCNNNN U2 0 0 1 chr17.fa 38386704 F .. 17A 23T
>FC305JN_20080525:1:15:1034:505 GAAAATGTTTCAAATCAATTTCTANNNN U0 1 0 0 chr2.fa 183305566 R ..
>FC305JN_20080525:1:15:983:126 GGATAGAGAGTTTGCACTGAGTTGNNNN U0 1 0 0 chrX.fa 92367529 F ..
>FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN U0 1 0 0 chr6.fa 20979453 R ..
>FC305JN_20080525:1:15:743:1028 GAATGGAATGGAATGGAAAGAAACNNNN R1 0 33 255
>FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN U2 0 0 1 chr6.fa 136877852 R .. 7A 13G
当前输出outfile.txt(示例):
>FC305JN_20080525:1:15:944:72 GATGACTTCCTTAATTTTCTTTATNNNN U0 1 0 0 chr6.fa 7200804 R ..
>FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN U0 1 0 0 chr6.fa 20979453 R ..
>FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN U2 0 0 1 chr6.fa 136877852 R .. 7A 13G
期望的输出(示例):
>FC305JN_20080525:1:15:944:72 GATGACTTCCTTAATTTTCTTTATNNNN chr6.fa 7200804 R
>FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN chr6.fa 20979453 R
>FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN chr6.fa 136877852 R
简化您的代码(使用从
grep -E "chr6.fa" FC305JN_s_1_eland_result.txt > out.txt
awk '{print , "\t", , "\t", , "\t", , "\t", }' out.txt > outfile.txt
产生输出:
FC305JN_20080525:1:15:944:72 GATGACTTCCTTAATTTTCTTTATNNNN chr6.fa 7200804 R
FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN chr6.fa 20979453 R
FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN chr6.fa 136877852 R
完整的 awk:
$ awk 'BEGIN {
FS=OFS="\t" # set correct delimiters
}
~/chr6\.fa/ { # replaces the grep part
print , , , , # output
}' file # your file goes here
输出:
>FC305JN_20080525:1:15:944:72 GATGACTTCCTTAATTTTCTTTATNNNN chr6.fa 7200804 R
>FC305JN_20080525:1:15:1799:100 TTCAGCTTATTGATAAAGAAGCACNNNN chr6.fa 20979453 R
>FC305JN_20080525:1:15:771:1076 GAGTTCACTAAACAAAAGAGTGTCNNNN chr6.fa 136877852 R