glob.glob 函数从文件中提取数据

Question

我正在尝试运行下面的脚本。该脚本的目的是依次打开不同的fasta文件，提取geneID。如果我不使用 glob.glob 函数，该脚本运行良好。我收到此消息 TypeError：强制转换为 Unicode：需要字符串或缓冲区，找到列表

files='/home/pathtofiles/files'
    #print files
    #sys.exit()
    for file in files:
        fastas=sorted(glob.glob(files + '/*.fasta'))
        #print fastas[0]
        output_handle=(open(fastas, 'r+'))
        genes_files=list(SeqIO.parse(output_handle, 'fasta'))
        geneID=genes_files[0].id
        print geneID

我正在运行关于如何在一个接一个的文件给我需要的信息时直接打开脚本的想法。

Answer 1

我明白你想做什么，但让我先解释一下为什么你目前的方法不起作用。

您有一个包含 fasta 文件的目录的路径，并且您想要遍历该目录中的文件。但是观察如果我们这样做会发生什么：

>>> files='/home/pathtofiles/files'
>>> for file in files:
>>>    print file
/
h
o
m
e
/
p
a
t
h
t
o
f
i
l
e
s
/
f
i
l
e
s

不是您期望的文件名列表！ files 是一个字符串，当您在字符串上应用 for 循环时，您只需迭代该字符串中的字符。

此外，正如 doctorlove 正确观察到的那样，在您的代码中 fastas 是一个列表，并且 open 期望文件路径作为第一个参数。这就是为什么你得到 TypeError: ... need string, ... list found.

顺便说一句（这在 Windows 上比在 Linux 或 Mac 上更成问题），但最好始终使用 raw string literals（在使用路径名时在字符串前加上 r) 以防止反斜杠转义序列（如 \n 和 \t 不必要地扩展到换行符和制表符。

>>> path = 'C:\Users\norah\temp'
>>> print path
C:\Users
orah    emp
>>> path = r'C:\Users\norah\temp'
>>> print path
C:\Users\norah\temp

另一个好的做法是在组合路径名和文件名时使用 os.path.join()。这可以防止您的脚本在您的机器上运行时出现细微错误，错误会在您使用不同操作系统的同事的机器上出现错误。

我也推荐 using the with statement when opening files。这可确保文件句柄在您使用完后正确关闭。

作为最后的评论，file is a built-in function in Python and it is bad practice to use a variable with the same name as a built-in 功能，因为这可能会在以后导致错误或混淆。

结合以上所有内容，我会像这样重写您的代码：

import os
import glob
from Bio import SeqIO

path = r'/home/pathtofiles/files'
pattern = os.path.join(path, '*.fasta')
for fasta_path in sorted(glob.glob(pattern)):
    print fasta_path
    with open(fasta_path, 'r+') as output_handle:
        genes_records = SeqIO.parse(output_handle, 'fasta')
        for gene_record in genes_records:
            print gene_record.id

Answer 2

这是我解决问题的方法，这个脚本有效。

    import os,sys
    import glob
    from Bio import SeqIO

def extracting_information_gene_id():
    #to extract geneID information and add the reference gene to each different file

    files=sorted(glob.glob('/home/path_to_files/files/*.fasta'))
    #print file
    #sys.exit()
    for file in files:
        #print file
        output_handle=open(file, 'r+')
        ref_genes=list(SeqIO.parse(output_handle, 'fasta'))
        geneID=ref_genes[0].id
        #print geneID
        #sys.exit()

        #to extract the geneID as a reference record from the genes_files
        query_genes=(SeqIO.index('/home/path_to_file/file.fa', 'fasta'))
        #print query_genes[geneID].format('fasta') #check point
        #sys.exit()
        ref_gene=query_genes[geneID].format('fasta')
        #print ref_gene #check point
        #sys.exit()
        output_handle.write(str(ref_gene))
        output_handle.close()
        query_genes.close()

extracting_information_gene_id()
print 'Reference gene sequence have been added'

glob.glob 函数从文件中提取数据

The glob.glob function to extract data from files

biopython

python-2.7