构建 Bowtie 索引失败（tophat2、bowtie2）

Question

（注意：标签应该是tophat2和bowtie2，但我没有创建新标签的积分）

您好：我正在使用 Tophat2（命令行）分析 RNA-seq 数据，但遇到了一些错误。

调用如下：

tophat2 -o tophat2_results/ -G ref_data/BA000007.2.gtf --transcriptome-index=transcriptome_data/RNA_LBG01b_241_filteredQ indices/BA000007.2 data_files/RNA_LBG01b_241_filteredQ.fastq

错误如下：

[2015-12-29 12:58:33] Checking for Bowtie
          Bowtie version:     2.2.4.0
[2015-12-29 12:58:33] Checking for Bowtie index files (genome)..
[2015-12-29 12:58:33] Checking for reference FASTA file
[2015-12-29 12:58:33] Generating SAM header for indices/BA000007.2
[2015-12-29 12:58:33] Reading known junctions from GTF file
    Warning: TopHat did not find any junctions in GTF file
[2015-12-29 12:58:33] Preparing reads
     left reads: min. length=12, max. length=342, 202732 kept reads (1315 discarded)
Warning: short reads (<20bp) will make TopHat quite slow and take large amount of memory because they are likely to be mapped in too many places
[2015-12-29 12:58:39] Building transcriptome data files transcriptome_data/RNA_LBG01b_241_filteredQ
[2015-12-29 12:58:40] Building Bowtie index from RNA_LBG01b_241_filteredQ.fa
    [FAILED]
Error: Couldn't build bowtie index with err = 1

版本信息： TopHat v2.1.0 Bowtie2 版本 2.2.4 Python 2.7.10 :: Anaconda 2.4.0（64 位）

系统信息： CentOS 6.7 版

我是如何来到这里的以及我尝试了什么：

我正在使用大肠杆菌（登录号：BA000007.2）作为我的参考基因组，可在此处找到：http://www.ncbi.nlm.nih.gov/nuccore/BA000007.2

我从 Ensemble (ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//gtf/bacteria_9_collection/escherichia_coli_o157_h7_str_sakai/) 获得了我的 GTF 文件

我使用 bowtie2-build 制作了我的索引（在调用 tophat2 之前）

bowtie2-build -f ref_data/BA000007.2.fasta indices/BA000007.2

我知道我收到的错误与 *.gtf 文件第一列中出现的不同名称和参考 fasta 文件的名称有关。如果我理解正确，第一列中的每个条目都应该是 BA000007.2，其中第一列中的大多数名称 "Chromosome"。为了解决这个问题，我做了以下事情：

awk '{FS=OFS="\t"}{print "BA000007.2", , , , , , , , }' pathToGTF/BA000007.2_ensemble.gtf > pathToGTF/BA000007.2.gtf

#Please note the commented build information (e.g., #!genome-build ASM80120v1) at the beginning of ensemble gtf file would create unerous output from the awk command has been addressed

我还把fasta文件的终止符从*.fasta改成了*.fa

问题：

我是否正确地解决了 gtf 文件的第一列和 fasta 文件的名称之间的命名差异引起的任何问题 (BA000007.2, BA000007.2.fa)?
当我细读日志目录中的输出时，有几个错误（g2f.err 和 ftf_juncs.log 中的类似错误），行开头为：

警告：以下行的起始坐标无效： BA000007.2 ena 基因 -194 2502 . + 。 gene_id "BAA31757"; gene_version "1"; gene_name "tagA"; gene_source "ena"; gene_biotype "protein_coding";

gtf文件中确实有负数，但genbank文件中没有（在vim中快速搜索）。这可能是错误的根源吗？我注释掉了特定的行并将它们从文件中删除——这两种方法仍然会导致错误。

是否有任何容易看到的东西可能导致“无法使用 err = 1 构建领结索引”错误？

我已经坚持了几天，所以非常感谢任何帮助。

Answer 1

我找到了问题的根源。它是参考 fasta 文件中的 header。最初的 header 是：

>gi|47118301|dbj|BA000007.2| Escherichia coli O157:H7 str. Sakai DNA, complete genome

应该在哪里

>BA000007

所以...如果fasta文件名为abc123.fa，那么fasta文件中的header一定是>abc123。 gtf文件的第一列也必须是abc123.

请注意，我在所有调用中将基数从 BA000007.2 更改为 BA000007，并且我重命名了名称中没有 .2 的所有文件。它可能仍然适用于 .2，但我没有对其进行测试（“基本名称是任何索引文件的名称，但不包括第一个句点。”[顶帽手册]）（谢谢AM）。最后，我将 fasta 文件从 *.fasta 重命名为 *.fa。

构建 Bowtie 索引失败（tophat2、bowtie2）

Building Bowtie index failure (tophat2, bowtie2)

bioinformatics