在 python 中以特定模式剪切字符串
Cut string within a specific pattern in python
我有一些长度的字符串,只包含 4 个字符 'A,T,G and C'。我在给定的字符串中多次出现模式 'GAATTC'。我必须以这种模式所在的间隔切割字符串..
例如对于字符串 'ATCGAATTCATA',我应该得到
的输出
- 字符串一 -
ATCGA
- 字符串二 -
ATTCATA
我是使用 Python 的新手,但我想出了以下(不完整的)代码:
seq = seq.upper()
str1 = "GAATTC"
seqlen = len(seq)
seq = list(seq)
for i in range(0,seqlen-1):
site = seq.find(str1)
print(site[0:(i+2)])
任何帮助将不胜感激。
这是一个简单的解决方案:
seq = 'ATCGAATTCATA'
seq_split = seq.upper().split('GAATTC')
result = [
(seq_split[i] + 'GA') if i % 2 == 0 else ('ATTC' + seq_split[i])
for i in range(len(seq_split)) if len(seq_split[i]) > 0
]
结果:
print(result)
['ATCGA', 'ATTCATA']
我的代码有点草率,但是当你想遍历多次出现的字符串时,你可以尝试这样的事情
def split_strings(seq):
string1 = seq[:seq.find(str1) +2]
string2 = seq[seq.find(str1) +2:]
return string1, string2
test = 'ATCGAATTCATA'.upper()
str1 = 'GAATTC'
seq = test
while str1 in seq:
string1, seq = split_strings(seq)
print string1
print seq
BioPython 有一个限制性内切酶包可以完全满足您的要求。
from Bio.Restriction import *
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
print(EcoRI.site) # You will see that this is the enzyme you listed above
test = 'ATCGAATTCATA'.upper() # This is the sequence you want to search
my_seq = Seq(test, IUPACAmbiguousDNA()) # Create a biopython Seq object with our sequence
cut_sites = EcoRI.search(my_seq)
cut_sites
包含一个列表,其中包含输入序列的确切切割位置(例如 GA
在左侧序列中,ATTC
在右侧序列中。
然后您可以使用以下方法将序列拆分为重叠群:
cut_sites = [0] + cut_sites # We add a leading zero so this works for the first
# contig. This might not always be needed.
contigs = [test[i:j] for i,j in zip(cut_sites, cut_sites[1:]+[None])]
您可以see this page了解有关 BioPython 的更多详细信息。
下面是一个使用正则表达式模块的解决方案:
import re
seq = 'ATCGAATTCATA'
restriction_site = re.compile('GAATTC')
subseq_start = 0
for match in restriction_site.finditer(seq):
print seq[subseq_start:match.start()+2]
subseq_start = match.start()+2
print seq[subseq_start:]
输出:
ATCGA
ATTCATA
首先让我们发展一下使用查找的想法,这样您就可以找出错误。
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GAATTC"
split_at = 2
seqlen = len(seq)
i = 0
while i < seqlen:
site = seq.find(pattern, i)
if site != -1:
print(seq[i: site + split_at])
i = site + split_at
else:
print seq[i:]
break
然而 python 字符串具有强大的替换方法,可以直接替换字符串片段。下面的代码片段使用替换方法在需要时插入分隔符:
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GA","ATTC"
pattern1 = ''.join(pattern) # 'GAATTC'
pattern2 = ' '.join(pattern) # 'GA ATTC'
splited_seq = seq.replace(pattern1, pattern2) # 'ATCGA ATTCATAATCGA ATTCATAATCGA ATTCATA'
print (splited_seq.split())
我认为它比 RE 更直观并且应该更快(RE 可能性能较低,具体取决于库和用法)
我有一些长度的字符串,只包含 4 个字符 'A,T,G and C'。我在给定的字符串中多次出现模式 'GAATTC'。我必须以这种模式所在的间隔切割字符串.. 例如对于字符串 'ATCGAATTCATA',我应该得到
的输出- 字符串一 -
ATCGA
- 字符串二 -
ATTCATA
我是使用 Python 的新手,但我想出了以下(不完整的)代码:
seq = seq.upper()
str1 = "GAATTC"
seqlen = len(seq)
seq = list(seq)
for i in range(0,seqlen-1):
site = seq.find(str1)
print(site[0:(i+2)])
任何帮助将不胜感激。
这是一个简单的解决方案:
seq = 'ATCGAATTCATA'
seq_split = seq.upper().split('GAATTC')
result = [
(seq_split[i] + 'GA') if i % 2 == 0 else ('ATTC' + seq_split[i])
for i in range(len(seq_split)) if len(seq_split[i]) > 0
]
结果:
print(result)
['ATCGA', 'ATTCATA']
我的代码有点草率,但是当你想遍历多次出现的字符串时,你可以尝试这样的事情
def split_strings(seq):
string1 = seq[:seq.find(str1) +2]
string2 = seq[seq.find(str1) +2:]
return string1, string2
test = 'ATCGAATTCATA'.upper()
str1 = 'GAATTC'
seq = test
while str1 in seq:
string1, seq = split_strings(seq)
print string1
print seq
BioPython 有一个限制性内切酶包可以完全满足您的要求。
from Bio.Restriction import *
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
print(EcoRI.site) # You will see that this is the enzyme you listed above
test = 'ATCGAATTCATA'.upper() # This is the sequence you want to search
my_seq = Seq(test, IUPACAmbiguousDNA()) # Create a biopython Seq object with our sequence
cut_sites = EcoRI.search(my_seq)
cut_sites
包含一个列表,其中包含输入序列的确切切割位置(例如 GA
在左侧序列中,ATTC
在右侧序列中。
然后您可以使用以下方法将序列拆分为重叠群:
cut_sites = [0] + cut_sites # We add a leading zero so this works for the first
# contig. This might not always be needed.
contigs = [test[i:j] for i,j in zip(cut_sites, cut_sites[1:]+[None])]
您可以see this page了解有关 BioPython 的更多详细信息。
下面是一个使用正则表达式模块的解决方案:
import re
seq = 'ATCGAATTCATA'
restriction_site = re.compile('GAATTC')
subseq_start = 0
for match in restriction_site.finditer(seq):
print seq[subseq_start:match.start()+2]
subseq_start = match.start()+2
print seq[subseq_start:]
输出:
ATCGA
ATTCATA
首先让我们发展一下使用查找的想法,这样您就可以找出错误。
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GAATTC"
split_at = 2
seqlen = len(seq)
i = 0
while i < seqlen:
site = seq.find(pattern, i)
if site != -1:
print(seq[i: site + split_at])
i = site + split_at
else:
print seq[i:]
break
然而 python 字符串具有强大的替换方法,可以直接替换字符串片段。下面的代码片段使用替换方法在需要时插入分隔符:
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GA","ATTC"
pattern1 = ''.join(pattern) # 'GAATTC'
pattern2 = ' '.join(pattern) # 'GA ATTC'
splited_seq = seq.replace(pattern1, pattern2) # 'ATCGA ATTCATAATCGA ATTCATAATCGA ATTCATA'
print (splited_seq.split())
我认为它比 RE 更直观并且应该更快(RE 可能性能较低,具体取决于库和用法)