使用 python 在两个文件中查找匹配项
find matches in two file using python
我正在分析测序数据,我需要找到它们功能的候选基因很少。
编辑可用的人类数据库后,我想将我的候选基因与数据库进行比较,并输出我的候选基因的功能。
我只有基本的 python 技能,所以我认为这可能会帮助我加快寻找候选基因功能的工作。
所以包含候选基因的文件1看起来像这样
Gene
AQP7
RLIM
SMCO3
COASY
HSPA6
和数据库,file2.csv 看起来像这样:
Gene function
PDCD6 Programmed cell death protein 6
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a
期望的输出
Gene(from file1) ,function(matching from file2)
我尝试使用此代码:
file1 = 'file1.csv'
file2 = 'file2.csv'
output = 'file3.txt'
with open(file1) as inf:
match = set(line.strip() for line in inf)
with open(file2) as inf, open(output, 'w') as outf:
for line in inf:
if line.split(' ',1)[0] in match:
outf.write(line)
我只得到空白页。
我试过使用交集函数
with open('file1.csv', 'r') as ref:
with open('file2.csv','r') as com:
with open('common_genes_function','w') as output:
same = set(ref).intersection(com)
print same
也不工作..
请帮忙,否则我需要手动执行此操作
我建议使用 pandas
merge
函数。但是,它需要在 'Gene' 和 'function' 列之间有一个明确的分隔符。在我的示例中,我假设它位于 tab:
import pandas as pd
#open files as pandas datasets
file1 = pd.read_csv(filepath1, sep = '\t')
file2 = pd.read_csv(filepath2, sep = '\t')
#merge files by column 'Gene' using 'inner', so it comes up
#with the intersection of both datasets
file3 = pd.merge(file1, file2, how = 'inner', on = ['Gene'], suffixes = ['1','2'])
file3.to_csv(filepath3, sep = ',')
使用基本 Python,您可以尝试以下操作:
import re
gene_function = {}
with open('file2.csv','r') as input:
lines = [line.strip() for line in input.readlines()[1:]]
for line in lines:
match = re.search("(\w+)\s+(.*)",line)
gene = match.group(1)
function = match.group(2)
if gene not in gene_function:
gene_function[gene] = function
with open('file1.csv','r') as input:
genes = [i.strip() for i in input.readlines()[1:]]
for gene in genes:
if gene in gene_function:
print "{}, {}".format(gene, gene_function[gene])
我正在分析测序数据,我需要找到它们功能的候选基因很少。
编辑可用的人类数据库后,我想将我的候选基因与数据库进行比较,并输出我的候选基因的功能。
我只有基本的 python 技能,所以我认为这可能会帮助我加快寻找候选基因功能的工作。
所以包含候选基因的文件1看起来像这样
Gene
AQP7
RLIM
SMCO3
COASY
HSPA6
和数据库,file2.csv 看起来像这样:
Gene function
PDCD6 Programmed cell death protein 6
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a
期望的输出
Gene(from file1) ,function(matching from file2)
我尝试使用此代码:
file1 = 'file1.csv'
file2 = 'file2.csv'
output = 'file3.txt'
with open(file1) as inf:
match = set(line.strip() for line in inf)
with open(file2) as inf, open(output, 'w') as outf:
for line in inf:
if line.split(' ',1)[0] in match:
outf.write(line)
我只得到空白页。
我试过使用交集函数
with open('file1.csv', 'r') as ref:
with open('file2.csv','r') as com:
with open('common_genes_function','w') as output:
same = set(ref).intersection(com)
print same
也不工作..
请帮忙,否则我需要手动执行此操作
我建议使用 pandas
merge
函数。但是,它需要在 'Gene' 和 'function' 列之间有一个明确的分隔符。在我的示例中,我假设它位于 tab:
import pandas as pd
#open files as pandas datasets
file1 = pd.read_csv(filepath1, sep = '\t')
file2 = pd.read_csv(filepath2, sep = '\t')
#merge files by column 'Gene' using 'inner', so it comes up
#with the intersection of both datasets
file3 = pd.merge(file1, file2, how = 'inner', on = ['Gene'], suffixes = ['1','2'])
file3.to_csv(filepath3, sep = ',')
使用基本 Python,您可以尝试以下操作:
import re
gene_function = {}
with open('file2.csv','r') as input:
lines = [line.strip() for line in input.readlines()[1:]]
for line in lines:
match = re.search("(\w+)\s+(.*)",line)
gene = match.group(1)
function = match.group(2)
if gene not in gene_function:
gene_function[gene] = function
with open('file1.csv','r') as input:
genes = [i.strip() for i in input.readlines()[1:]]
for gene in genes:
if gene in gene_function:
print "{}, {}".format(gene, gene_function[gene])