有没有更快的方法在两个数组 (Python) 中找到匹配的特征?
Is there a faster way to find matching features in two arrays (Python)?
我试图遍历一个文件中的每个特征(每行 1 个),并根据第二个文件中该行的一列找到所有匹配的特征。我有这个解决方案,它可以在小文件上执行我想要的操作,但在大文件上速度很慢(我的文件有 >20,000,000 行)。 Here's a sample of the two input files.
我的(慢)代码:
FEATUREFILE = 'S2_STARRseq_rep1_vsControl_peaks.bed'
CONSERVATIONFILEDIR = './conservation/'
with open(str(FEATUREFILE),'r') as peakFile, open('featureConservation.td',"w+") as outfile:
for line in peakFile.readlines():
chrom = line.split('\t')[0]
startPos = int(line.split('\t')[1])
endPos = int(line.split('\t')[2])
peakName = line.split('\t')[3]
enrichVal = float(line.split('\t')[4])
#Reject negative peak starts, if they exist (sometimes this can happen w/ MACS)
if startPos > 0:
with open(str(CONSERVATIONFILEDIR) + str(chrom)+'.bed','r') as conservationFile:
cumulConserv = 0.
n = 0
for conservLine in conservationFile.readlines():
position = int(conservLine.split('\t')[1])
conservScore = float(conservLine.split('\t')[3])
if position >= startPos and position <= endPos:
cumulConserv += conservScore
n+=1
featureConservation = cumulConserv/(n)
outfile.write(str(chrom) + '\t' + str(startPos) + '\t' + str(endPos) + '\t' + str(peakName) + '\t' + str(enrichVal) + '\t' + str(featureConservation) + '\n')
首先,每次从 peakFile
读取一行时,您都会遍历 conservationFile
的所有内容,因此在 if 语句中的 n+=1
之后添加一个 break
这应该有所帮助。假设只有一个匹配项是。
另一种选择是尝试使用 mmap 这可能有助于缓冲
Bedtools 就是为此而生的,特别是 intersect
功能:
http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
对我来说最好的解决方案似乎是为 pandas 重写上面的代码。以下是一些非常大的文件对我来说效果很好的方法:
from __future__ import division
import pandas as pd
FEATUREFILE = 'S2_STARRseq_rep1_vsControl_peaks.bed'
CONSERVATIONFILEDIR = './conservation/'
peakDF = pd.read_csv(str(FEATUREFILE), sep = '\t', header=None, names=['chrom','start','end','name','enrichmentVal'])
#Reject negative peak starts, if they exist (sometimes this can happen w/ MACS)
peakDF.drop(peakDF[peakDF.start <= 0].index, inplace=True)
peakDF.reset_index(inplace=True)
peakDF.drop('index', axis=1, inplace=True)
peakDF['conservation'] = 1.0 #placeholder
chromNames = peakDF.chrom.unique()
for chromosome in chromNames:
chromSubset = peakDF[peakDF.chrom == str(chromosome)]
chromDF = pd.read_csv(str(CONSERVATIONFILEDIR) + str(chromosome)+'.bed', sep='\t', header=None, names=['chrom','start','end','conserveScore'])
for i in xrange(0,len(chromSubset.index)):
x = chromDF[chromDF.start >= chromSubset['start'][chromSubset.index[i]]]
featureSubset = x[x.start < chromSubset['end'][chromSubset.index[i]]]
x=None
featureConservation = float(sum(featureSubset.conserveScore)/(chromSubset['end'][chromSubset.index[i]]-chromSubset['start'][chromSubset.index[i]]))
peakDF.set_value(chromSubset.index[i],'conservation',featureConservation)
featureSubset=None
peakDF.to_csv("featureConservation.td", sep = '\t')
我试图遍历一个文件中的每个特征(每行 1 个),并根据第二个文件中该行的一列找到所有匹配的特征。我有这个解决方案,它可以在小文件上执行我想要的操作,但在大文件上速度很慢(我的文件有 >20,000,000 行)。 Here's a sample of the two input files.
我的(慢)代码:
FEATUREFILE = 'S2_STARRseq_rep1_vsControl_peaks.bed'
CONSERVATIONFILEDIR = './conservation/'
with open(str(FEATUREFILE),'r') as peakFile, open('featureConservation.td',"w+") as outfile:
for line in peakFile.readlines():
chrom = line.split('\t')[0]
startPos = int(line.split('\t')[1])
endPos = int(line.split('\t')[2])
peakName = line.split('\t')[3]
enrichVal = float(line.split('\t')[4])
#Reject negative peak starts, if they exist (sometimes this can happen w/ MACS)
if startPos > 0:
with open(str(CONSERVATIONFILEDIR) + str(chrom)+'.bed','r') as conservationFile:
cumulConserv = 0.
n = 0
for conservLine in conservationFile.readlines():
position = int(conservLine.split('\t')[1])
conservScore = float(conservLine.split('\t')[3])
if position >= startPos and position <= endPos:
cumulConserv += conservScore
n+=1
featureConservation = cumulConserv/(n)
outfile.write(str(chrom) + '\t' + str(startPos) + '\t' + str(endPos) + '\t' + str(peakName) + '\t' + str(enrichVal) + '\t' + str(featureConservation) + '\n')
首先,每次从 peakFile
读取一行时,您都会遍历 conservationFile
的所有内容,因此在 if 语句中的 n+=1
之后添加一个 break
这应该有所帮助。假设只有一个匹配项是。
另一种选择是尝试使用 mmap 这可能有助于缓冲
Bedtools 就是为此而生的,特别是 intersect
功能:
http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
对我来说最好的解决方案似乎是为 pandas 重写上面的代码。以下是一些非常大的文件对我来说效果很好的方法:
from __future__ import division
import pandas as pd
FEATUREFILE = 'S2_STARRseq_rep1_vsControl_peaks.bed'
CONSERVATIONFILEDIR = './conservation/'
peakDF = pd.read_csv(str(FEATUREFILE), sep = '\t', header=None, names=['chrom','start','end','name','enrichmentVal'])
#Reject negative peak starts, if they exist (sometimes this can happen w/ MACS)
peakDF.drop(peakDF[peakDF.start <= 0].index, inplace=True)
peakDF.reset_index(inplace=True)
peakDF.drop('index', axis=1, inplace=True)
peakDF['conservation'] = 1.0 #placeholder
chromNames = peakDF.chrom.unique()
for chromosome in chromNames:
chromSubset = peakDF[peakDF.chrom == str(chromosome)]
chromDF = pd.read_csv(str(CONSERVATIONFILEDIR) + str(chromosome)+'.bed', sep='\t', header=None, names=['chrom','start','end','conserveScore'])
for i in xrange(0,len(chromSubset.index)):
x = chromDF[chromDF.start >= chromSubset['start'][chromSubset.index[i]]]
featureSubset = x[x.start < chromSubset['end'][chromSubset.index[i]]]
x=None
featureConservation = float(sum(featureSubset.conserveScore)/(chromSubset['end'][chromSubset.index[i]]-chromSubset['start'][chromSubset.index[i]]))
peakDF.set_value(chromSubset.index[i],'conservation',featureConservation)
featureSubset=None
peakDF.to_csv("featureConservation.td", sep = '\t')