使用 python 搜索特定的重复
search for specific repeats using python
输入文件示例:
1 AAcgGGGGGGtacctgt yes
2 TTcccccctgtAAcgta no
3 tcgAAAAaatacgacc no
4 AAcgtataatacctgt no
...
我想编写一个程序来扫描每个序列并检查单体核苷酸重复 (mnr)
示例输出:
1,AAcgGGGGGGtacctgt,yes
2,TTcccccctgtAAcgta,no
定义:单体核苷酸是:A,T,C,G的重复(不区分大小写)
我在连续寻找的是这样的:
AAAAaaAAgtc
要么
gtAAAAAAAAAAc
要么
aaaaaaAAA
要么
aaaaaaaaaa
要么
ccccccccccc
要么
CCCCCcccCCC
或者...
我试过这个正则表达式但不起作用:
import csv
import re
list=[]
with open('sequences.txt', 'r') as f:
reader = csv.reader(f,delimiter="\t")
seq=re.findall(r'[Aa]{6, }','sequences.txt')
for line in reader:
if line.__contains__(seq):
print(list.append(line))
感谢任何帮助。
更新:已经提出了使用正则表达式的部分解决方案。请注意,以下解决方案不适用于使用正则表达式,而是查找长度为六或更多的任何字符的任何序列。
测试数据:
number,sequence,status
1,kjhfklashfkldflkhasdfl,0
2,aaaaaljgkldfkjgldkfjgfldj,0
3,bbbbbbjigdfsjgjg,0
4,ccCccCCcjjfijsdfjsdf,0
5,klsjdflsjdfhdddddjnjlkhngjk,0
6,kjkljfhnlasjkdfheeeeeeejjjeeeeeeeeeekjdkljfleeef,0
7,jhfshffFffFFFFffkljjjj908u89,0
查找长度为 6 或更大的 MNR 的代码:
import csv
def contains_mnr(sequence):
start_char = "$" # choose a character that is sure not to be in the sequence
count = 0
seq_lower = sequence.lower()
for pos in range(0, len(seq_lower)):
if seq_lower[pos] == start_char:
count += 1
else:
start_char = seq_lower[pos]
count = 1
if count >= 6:
return True
return False
with open("input.csv", "r") as input_file:
with open("output.csv", "w") as output_file:
reader = csv.DictReader(input_file, dialect=csv.unix_dialect())
writer = csv.writer(output_file, dialect=csv.unix_dialect())
writer.writerow(reader.fieldnames)
for row in reader:
if contains_mnr(row["sequence"]):
writer.writerow([
row["number"],
row["sequence"],
row["status"]
])
请注意,可能需要调整 CSV 方言以适应代码为 运行 并生成数据文件的系统。
输出上面给定的测试数据:
"number","sequence","status"
"3","bbbbbbjigdfsjgjg","0"
"4","ccCccCCcjjfijsdfjsdf","0"
"6","kjkljfhnlasjkdfheeeeeeejjjeeeeeeeeeekjdkljfleeef","0"
"7","jhfshffFffFFFFffkljjjj908u89","0"
这是您想要的紧凑型解决方案:
import csv
with open('sequences.txt', 'r') as f:
reader = csv.reader(f, delimiter=",")
for line in reader:
seq_lower = line[1].lower()
if 'aaaaaa' in seq_lower or 'cccccc' in seq_lower or 'tttttt' in seq_lower or 'gggggg' in seq_lower:
print(line)
这里我假设您在处理 DNA 序列时只考虑 a,c,g,t
的 mnrs。
输入文件示例:
1 AAcgGGGGGGtacctgt yes
2 TTcccccctgtAAcgta no
3 tcgAAAAaatacgacc no
4 AAcgtataatacctgt no
...
我想编写一个程序来扫描每个序列并检查单体核苷酸重复 (mnr)
示例输出:
1,AAcgGGGGGGtacctgt,yes
2,TTcccccctgtAAcgta,no
定义:单体核苷酸是:A,T,C,G的重复(不区分大小写)
我在连续寻找的是这样的:
AAAAaaAAgtc
要么
gtAAAAAAAAAAc
要么
aaaaaaAAA
要么
aaaaaaaaaa
要么
ccccccccccc
要么
CCCCCcccCCC
或者...
我试过这个正则表达式但不起作用:
import csv
import re
list=[]
with open('sequences.txt', 'r') as f:
reader = csv.reader(f,delimiter="\t")
seq=re.findall(r'[Aa]{6, }','sequences.txt')
for line in reader:
if line.__contains__(seq):
print(list.append(line))
感谢任何帮助。
更新:已经提出了使用正则表达式的部分解决方案。请注意,以下解决方案不适用于使用正则表达式,而是查找长度为六或更多的任何字符的任何序列。
测试数据:
number,sequence,status
1,kjhfklashfkldflkhasdfl,0
2,aaaaaljgkldfkjgldkfjgfldj,0
3,bbbbbbjigdfsjgjg,0
4,ccCccCCcjjfijsdfjsdf,0
5,klsjdflsjdfhdddddjnjlkhngjk,0
6,kjkljfhnlasjkdfheeeeeeejjjeeeeeeeeeekjdkljfleeef,0
7,jhfshffFffFFFFffkljjjj908u89,0
查找长度为 6 或更大的 MNR 的代码:
import csv
def contains_mnr(sequence):
start_char = "$" # choose a character that is sure not to be in the sequence
count = 0
seq_lower = sequence.lower()
for pos in range(0, len(seq_lower)):
if seq_lower[pos] == start_char:
count += 1
else:
start_char = seq_lower[pos]
count = 1
if count >= 6:
return True
return False
with open("input.csv", "r") as input_file:
with open("output.csv", "w") as output_file:
reader = csv.DictReader(input_file, dialect=csv.unix_dialect())
writer = csv.writer(output_file, dialect=csv.unix_dialect())
writer.writerow(reader.fieldnames)
for row in reader:
if contains_mnr(row["sequence"]):
writer.writerow([
row["number"],
row["sequence"],
row["status"]
])
请注意,可能需要调整 CSV 方言以适应代码为 运行 并生成数据文件的系统。
输出上面给定的测试数据:
"number","sequence","status"
"3","bbbbbbjigdfsjgjg","0"
"4","ccCccCCcjjfijsdfjsdf","0"
"6","kjkljfhnlasjkdfheeeeeeejjjeeeeeeeeeekjdkljfleeef","0"
"7","jhfshffFffFFFFffkljjjj908u89","0"
这是您想要的紧凑型解决方案:
import csv
with open('sequences.txt', 'r') as f:
reader = csv.reader(f, delimiter=",")
for line in reader:
seq_lower = line[1].lower()
if 'aaaaaa' in seq_lower or 'cccccc' in seq_lower or 'tttttt' in seq_lower or 'gggggg' in seq_lower:
print(line)
这里我假设您在处理 DNA 序列时只考虑 a,c,g,t
的 mnrs。