进行所有排列组合并替换为字符串
Make all permutation combination and replace in string
我有带字符串的数据框列,类似于:
'TCCTGTAAATCAAAGGCCAAGRG'
、'GNGCNCCNGAYATRGCNTTYCC'
、'GATTTCTCTYCCTGTTCTTGCA'
我有一个字母列表:
SNPs={}
SNPs["Y"] = ['C', 'T']
SNPs["R"] = ['A', 'G']
SNPs["N"] = ['C', 'G', 'A', 'T']
其中每个 R 都需要更改为 A/G 等等...
例如:TCCTGTAAATCAAAGGCCAAGRG
更改为 TCCTGTAAATCAAAGGCCAAGAG
和 TCCTGTAAATCAAAGGCCAAGGG
.
我想要所有的排列和组合以及其他列中的结果。
请帮助我。
import re, itertools
text = "GNGCNCCNGAYATRGCNTTYCC"
def getList(dict):
return list(dict.keys())
lsources = getList(SNPs)
ldests = []
for source in lsources:
ldests.append(SNPs[source])
#print(ldests)
# Generate the various pairings
for lproduct in itertools.product(*ldests):
#print(lproduct)
for i in text:
output = i
for src, dest in zip(lsources, lproduct):
# Replace each term (you could optimise this using a single re.sub)
output = output.replace("%s" % src, dest)
print(output)
这是我的代码..但我没有得到想要的输出
试试这个:
>>> import itertools
>>> text = "GNGCNCCNGAYATRGCNTTYCC"
>>> SNPs={ "Y" : ['C', 'T'] , "R" : ['A', 'G'] , "N" : ['C', 'G', 'A', 'T']}
>>> text_tmp = ""
>>> dct = {}
>>> for idx, v in enumerate(text):
... if v in SNPs:
... dct[idx] = SNPs.get(v)
... text_tmp += f'_{idx}_'
... else:
... text_tmp += v
>>> text_tmp
'G_1_GC_4_CC_7_GA_10_AT_13_GC_16_TT_19_CC'
>>> dct
{1: ['C', 'G', 'A', 'T'],
4: ['C', 'G', 'A', 'T'],
7: ['C', 'G', 'A', 'T'],
10: ['C', 'T'],
13: ['A', 'G'],
16: ['C', 'G', 'A', 'T'],
19: ['C', 'T']}
>>> per_val = list(itertools.product(*dct.values()))
>>> per_key_val = list(map(dict,[zip(dct.keys(), p) for p in per_val]))
>>> per_key_val
[{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'C', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'C', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'G', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'G', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'A', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'A', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'T', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'T', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'C', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'C', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'G', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'G', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'A', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'A', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'T', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'T', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'C', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'C', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'G', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'G', 19: 'T'},
...
]
>>> out = []
>>> for pkl in per_key_val:
... tmp = text_tmp
... for k,v in pkl.items():
... tmp = tmp.replace(f'_{k}_', v)
... out.append(tmp)
>>> out
['GCGCCCCCGACATAGCCTTCCC',
'GCGCCCCCGACATAGCCTTTCC',
'GCGCCCCCGACATAGCGTTCCC',
'GCGCCCCCGACATAGCGTTTCC',
'GCGCCCCCGACATAGCATTCCC',
'GCGCCCCCGACATAGCATTTCC',
'GCGCCCCCGACATAGCTTTCCC',
'GCGCCCCCGACATAGCTTTTCC',
'GCGCCCCCGACATGGCCTTCCC',
'GCGCCCCCGACATGGCCTTTCC',
'GCGCCCCCGACATGGCGTTCCC',
'GCGCCCCCGACATGGCGTTTCC',
'GCGCCCCCGACATGGCATTCCC',
'GCGCCCCCGACATGGCATTTCC',
'GCGCCCCCGACATGGCTTTCCC',
'GCGCCCCCGACATGGCTTTTCC',
'GCGCCCCCGATATAGCCTTCCC',
'GCGCCCCCGATATAGCCTTTCC',
'GCGCCCCCGATATAGCGTTCCC',
'GCGCCCCCGATATAGCGTTTCC',
'GCGCCCCCGATATAGCATTCCC',
'GCGCCCCCGATATAGCATTTCC',
'GCGCCCCCGATATAGCTTTCCC',
...
]
更新: (数据帧上的运行)
def rplc_per(text):
SNPs={ "Y" : ['C', 'T'] , "R" : ['A', 'G'] , "N" : ['C', 'G', 'A', 'T']}
text_tmp = ""
dct = {}
for idx, v in enumerate(text):
if v in SNPs:
dct[idx] = SNPs.get(v)
text_tmp += f'_{idx}_'
else:
text_tmp += v
per_val = list(itertools.product(*dct.values()))
per_key_val = list(map(dict,[zip(dct.keys(), p) for p in per_val]))
out = []
for pkl in per_key_val:
tmp = text_tmp
for k,v in pkl.items():
tmp = tmp.replace(f'_{k}_', v)
out.append(tmp)
return out
df = pd.DataFrame({'String': ['TCCTGTAAATCAAAGGCCAAGRG', 'GNGCNCCNGAYATRGCNTTYCC', 'GATTTCTCTYCCTGTTCTTGCA']})
df['all_per'] = df['String'].apply(rplc_per)
print(df)
输出:
String all_per
0 TCCTGTAAATCAAAGGCCAAGRG [TCCTGTAAATCAAAGGCCAAGAG, TCCTGTAAATCAAAGGCCAA...
1 GNGCNCCNGAYATRGCNTTYCC [GCGCCCCCGACATAGCCTTCCC, GCGCCCCCGACATAGCCTTTC...
2 GATTTCTCTYCCTGTTCTTGCA [GATTTCTCTCCCTGTTCTTGCA, GATTTCTCTTCCTGTTCTTGCA]
我有带字符串的数据框列,类似于:
'TCCTGTAAATCAAAGGCCAAGRG'
、'GNGCNCCNGAYATRGCNTTYCC'
、'GATTTCTCTYCCTGTTCTTGCA'
我有一个字母列表:
SNPs={}
SNPs["Y"] = ['C', 'T']
SNPs["R"] = ['A', 'G']
SNPs["N"] = ['C', 'G', 'A', 'T']
其中每个 R 都需要更改为 A/G 等等...
例如:TCCTGTAAATCAAAGGCCAAGRG
更改为 TCCTGTAAATCAAAGGCCAAGAG
和 TCCTGTAAATCAAAGGCCAAGGG
.
我想要所有的排列和组合以及其他列中的结果。
请帮助我。
import re, itertools
text = "GNGCNCCNGAYATRGCNTTYCC"
def getList(dict):
return list(dict.keys())
lsources = getList(SNPs)
ldests = []
for source in lsources:
ldests.append(SNPs[source])
#print(ldests)
# Generate the various pairings
for lproduct in itertools.product(*ldests):
#print(lproduct)
for i in text:
output = i
for src, dest in zip(lsources, lproduct):
# Replace each term (you could optimise this using a single re.sub)
output = output.replace("%s" % src, dest)
print(output)
这是我的代码..但我没有得到想要的输出
试试这个:
>>> import itertools
>>> text = "GNGCNCCNGAYATRGCNTTYCC"
>>> SNPs={ "Y" : ['C', 'T'] , "R" : ['A', 'G'] , "N" : ['C', 'G', 'A', 'T']}
>>> text_tmp = ""
>>> dct = {}
>>> for idx, v in enumerate(text):
... if v in SNPs:
... dct[idx] = SNPs.get(v)
... text_tmp += f'_{idx}_'
... else:
... text_tmp += v
>>> text_tmp
'G_1_GC_4_CC_7_GA_10_AT_13_GC_16_TT_19_CC'
>>> dct
{1: ['C', 'G', 'A', 'T'],
4: ['C', 'G', 'A', 'T'],
7: ['C', 'G', 'A', 'T'],
10: ['C', 'T'],
13: ['A', 'G'],
16: ['C', 'G', 'A', 'T'],
19: ['C', 'T']}
>>> per_val = list(itertools.product(*dct.values()))
>>> per_key_val = list(map(dict,[zip(dct.keys(), p) for p in per_val]))
>>> per_key_val
[{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'C', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'C', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'G', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'G', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'A', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'A', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'T', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'A', 16: 'T', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'C', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'C', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'G', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'G', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'A', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'A', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'T', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'C', 13: 'G', 16: 'T', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'C', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'C', 19: 'T'},
{1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'G', 19: 'C'},
{1: 'C', 4: 'C', 7: 'C', 10: 'T', 13: 'A', 16: 'G', 19: 'T'},
...
]
>>> out = []
>>> for pkl in per_key_val:
... tmp = text_tmp
... for k,v in pkl.items():
... tmp = tmp.replace(f'_{k}_', v)
... out.append(tmp)
>>> out
['GCGCCCCCGACATAGCCTTCCC',
'GCGCCCCCGACATAGCCTTTCC',
'GCGCCCCCGACATAGCGTTCCC',
'GCGCCCCCGACATAGCGTTTCC',
'GCGCCCCCGACATAGCATTCCC',
'GCGCCCCCGACATAGCATTTCC',
'GCGCCCCCGACATAGCTTTCCC',
'GCGCCCCCGACATAGCTTTTCC',
'GCGCCCCCGACATGGCCTTCCC',
'GCGCCCCCGACATGGCCTTTCC',
'GCGCCCCCGACATGGCGTTCCC',
'GCGCCCCCGACATGGCGTTTCC',
'GCGCCCCCGACATGGCATTCCC',
'GCGCCCCCGACATGGCATTTCC',
'GCGCCCCCGACATGGCTTTCCC',
'GCGCCCCCGACATGGCTTTTCC',
'GCGCCCCCGATATAGCCTTCCC',
'GCGCCCCCGATATAGCCTTTCC',
'GCGCCCCCGATATAGCGTTCCC',
'GCGCCCCCGATATAGCGTTTCC',
'GCGCCCCCGATATAGCATTCCC',
'GCGCCCCCGATATAGCATTTCC',
'GCGCCCCCGATATAGCTTTCCC',
...
]
更新: (数据帧上的运行)
def rplc_per(text):
SNPs={ "Y" : ['C', 'T'] , "R" : ['A', 'G'] , "N" : ['C', 'G', 'A', 'T']}
text_tmp = ""
dct = {}
for idx, v in enumerate(text):
if v in SNPs:
dct[idx] = SNPs.get(v)
text_tmp += f'_{idx}_'
else:
text_tmp += v
per_val = list(itertools.product(*dct.values()))
per_key_val = list(map(dict,[zip(dct.keys(), p) for p in per_val]))
out = []
for pkl in per_key_val:
tmp = text_tmp
for k,v in pkl.items():
tmp = tmp.replace(f'_{k}_', v)
out.append(tmp)
return out
df = pd.DataFrame({'String': ['TCCTGTAAATCAAAGGCCAAGRG', 'GNGCNCCNGAYATRGCNTTYCC', 'GATTTCTCTYCCTGTTCTTGCA']})
df['all_per'] = df['String'].apply(rplc_per)
print(df)
输出:
String all_per
0 TCCTGTAAATCAAAGGCCAAGRG [TCCTGTAAATCAAAGGCCAAGAG, TCCTGTAAATCAAAGGCCAA...
1 GNGCNCCNGAYATRGCNTTYCC [GCGCCCCCGACATAGCCTTCCC, GCGCCCCCGACATAGCCTTTC...
2 GATTTCTCTYCCTGTTCTTGCA [GATTTCTCTCCCTGTTCTTGCA, GATTTCTCTTCCTGTTCTTGCA]