Python; DNA 序列到 AscII 文本

Python; DNA Sequence to AscII Text

我的目标是在一个很长的 (>115,000) DNA 序列中发现一段通过 AscII 8 位隐藏的文本。

我已经编写代码打开包含 DNA 的文件,将所有 C 和 A 的 转换为 0 和所有 T 和 G1。然后我将这个字符串转换为 AscII 字符。下面是我的代码。

with open("DNAseq.txt") as mydnaseq:
    sequence = mydnaseq.read().replace('\n','')

DNAa = sequence.replace('A','0').replace('C','0').replace('G','1').replace('T','1')
DNAb = ''.join(DNAa)

DNAc = [DNAb[i:i+8] for i in range(0, len(DNAb), 8)]

DNAd = []
for i in DNAc:
    j = int(i,2)
    DNAd.append(j)


DNA1 = []
for i in DNAd:
    if i >= 32 and i <=127:
        DNA1.append(i)

text = []
for i in DNAd:
    j = chr(i)
    text.append(j)

Answer = open("textanswer.txt", 'w')
Answer.writelines(text)
Answer.close()

但是我遇到了错误;

UnicodeEncodeError: 'charmap' codec can't encode character '\x9e' in position 0: character maps to <undefined>

而且我不知道这可能是什么。我的 DNA 序列显然混合了随机字符,但只有 play/poem.

的片段

我用 testDNA.txt 测试了我的代码,其中包含以下内容;

ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCT
TAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG

这个returns(符合预期);

Steak Bake

任何人都可以解释为什么我的 DNA 序列出现这个错误吗?

我想你想使用内置的 chr() 函数。

这是一个使用 str.translate 将字符转换为数字字符的简短示例。然后将子字符串转换为它们的 ascii 等价物。

>>> s = "ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCTTAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG"
>>> trans_dict = {"A":"0", "C":"0", "G":"1", "T":"1"}
>>> trans_table = str.maketrans(trans_dict)
>>> s.translate(trans_table)
'01010011011101000110010101100001011010110010000001000010011000010110101101100101'
>>> t = s.translate(trans_table)
>>> [t[i:i+8] for i in range(0, len(t), 8)]
['01010011', '01110100', '01100101', '01100001', '01101011', '00100000', '01000010', '01100001', '01101011', '01100101']
>>> [chr(int(t[i:i+8],2)) for i in range(0, len(t), 8)]
['S', 't', 'e', 'a', 'k', ' ', 'B', 'a', 'k', 'e']

正如我在评论中提到的,DNAd 包含有效 ASCII 范围之外的数字。但是你在创建 DNA1 时已经过滤掉了那些,所以你应该循环 DNA1 来构建 text.

然而,在Python 3 中没有必要对每个ASCII 代码编号调用chr 函数。您可以简单地将列表(或任何其他可迭代对象)传递给 bytes 构造函数,它将构建一个 bytes 字符串,然后您可以将其解码为 Unicode 文本。

此外,我们可以使用 str.translate,而不是使用 str.replace 方法将 DNA 字母转换为 '0' 和 '1' 字符,这在需要映射单个字符时效率更高字符到其他单个字符; str.translate也可以删除不需要的字符。在下面的代码中,我用它来删除空格和换行符。我还删除了您的 'DNAseq.txt' 文件开头的 Unicode Byte Order Mark

首先,这是一个使用问题中给出的短DNA序列的演示。

# Translation table to convert DNA letters to bit characters
# Deletes newlines, spaces, and the Unicode Byte Order Mark
tbl = str.maketrans('ACGT', '0011', '\n \ufeff')

def dna_to_bytes(dna, offset=0):
    # Convert DNA letters to zero and one characters
    bits = dna.translate(tbl)
    # Convert groups of 8 zeros and ones to bytes, starting from `offset`
    return bytes(int(bits[i:i+8], 2) for i in range(offset, len(bits), 8))

dna = '''\
ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCT
TAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG
'''

print(dna_to_bytes(dna).decode('ascii'))

输出

Steak Bake

要找到隐藏在您的 DNAseq.txt 文件中的消息,我们需要像您的代码一样忽略有效 ASCII 范围之外的字节。但是,在开始将 8 位块转换为字节之前,我们还需要跳过几位。只有 8 种可能的偏移量,并且由于数据量不大,因此很容易通过反复试验发现正确的偏移量 2。 OTOH,我确实花了一点时间才想到尝试抵消。 ;) 如果我们处理数百万字节,那么我们可能需要求助于统计分析来找到可能是有效英语的字节块。

下面的程序不会尝试隔离隐藏的消息,很容易在垃圾文本的中间发现。请注意,消息的第 1 行隐藏在之前一长串垃圾的末尾。

# ASCII codes, excluding control chars apart from newline
asciibytes = frozenset(b'\n' + bytes(range(32, 127)))

# Translation table to convert DNA letters to bit characters
# Deletes newlines, spaces, and the Unicode Byte Order Mark
tbl = str.maketrans('ACGT', '0011', '\n \ufeff')

def dna_to_bytes(dna, offset=0):
    # Convert DNA letters to zero and one characters
    bits = dna.translate(tbl)
    # Convert groups of 8 zeros and ones to bytes, starting from `offset`
    return bytes(int(bits[i:i+8], 2) for i in range(offset, len(bits), 8))

fname = 'DNAseq.txt'
with open(fname) as f:
    dna = f.read()

b = dna_to_bytes(dna, offset=2)
a = bytes(u for u in b if u in asciibytes)
print(a.decode('ascii'))

输出

;J\Zza%_&jHs F0kM:!ZsfCq1)^7!Bg%=8:2eMz(|tl KRS@@9$`!2wAD5@>K~_CA"u_R9<
p?+D*WRCH`=LY/v0&Sl[l|"x1h-_GT!P'36'PS&&<eY5yakZd?$R!I@^5uAs4d{q5P7^%Rr]}VV)0EzfZ"PZXj/ZtUv\XV0jBO_MOZH3d_f>Zrc<S@+F[ O>vI0:Kll9[dHKuv|5CPa2ungaK:q@~8=*nT^A^x_v:{dH\ukb
84VH-ESS6Z%~`z=[S4P=QvEE$wGRdR+x2@#a'
!&:!Ei:ttE;C9MWp:sF
)91J"7c@,2@{0$c,6R0=p.RJawE*U+}}Vo^2Dhf-PAn@O1yPIH~4J9e6H %,3>)@:K(N_o4\`'`;yQ$
?5t'^@W*YlaEI(@CT*H^u.1 czQ*
H`SzD)4W"[JEnI0E`N 3[gAP`Ve_mBE\v!932E&V4sw~*RurKPq2;B*BwF6c-'fJ~<=25=EAea\Qu!:NW:@d'"ZB?q 0D9FrbGm*PLR*^QwCg>,a,U'_-&P!#;h.f3E!jt]
BOGnmt0*#
g'zkeF;g"kBU(/`I1dxO`+0Q=6bqxI_Y\k#?'r'2nfJ"R$<eaw,(<LIUQxMPqsb}Us/ga?/UY3N#<DWh*$ry#BhtOL'+&c.CZ]BpRM1]bEVfhw2aaNGyR4r,V[Bx=`fd+%@eiH-bXv2lYM8gj958PK"XSWT?w_`E;.-`yxxXmIt+THhC4CVT%9-+T;BX0H
9wTnr (\KibvKI:OZUQ <x*"`_9.nc" W"x>A0?4D%=fHpa cvai;+a3*@2<@u!x|R0QQJ8|\`jrFPJH!$v=?bXe54[9oTBno
*ly[1EbHPh/Lh8c9*YQ0BR9NI,-q$IR~]$g#%'[,y.8He%e@Pg 9\v(:31wt9>VcP<Dl37`|yIU>nI"ZJ5Q4_}gNzK$.h;d[=13=]$HI)ixAI3lahaIc@$*Q3/RJfI1"c%Mq^eo9AsPan 'TZPbdFDuBG,^0t[3Nuf@ C%6%k+RxR IYqArp6L"vDxE&Q#FdN\,UNy_)d;Ap}AI6ZW7f/L/@RiTg1or*+^'{ >$I@~2jp<ph/LB*XRh#_7Y^*d.fJ[#Odx."v&IYU%:HB4;(iMh[H jAYci5I){_}1A64{/'CRsYWdkP[!h$s"-KmsM+eLa$||N\#H"NYS.[_#+r4?m7*AredM!_%/;tFP#M4hh?kA)Z%zJ3-x]KK.FcAYOHO+dzLD'w|:,>?qG4mU&T+ABFXV@Wa&ER;0zEj.Qi?<tff(*Y)M~rRgWxd^dnlm{ATYy;^a'
[elI[nu/}42#kI$+3w"8pehY7`A<NV5V(J\?z=R-(;*d&\-c?OJ,zcs?`l6QZ5`U2U%m"F&!0 WBOVqeY5*^@j'j(S.a3{1C9&'W,
vo*a!U1]UQcib>%QlI]|B$U/zzQd)_$b f [d_";JgQ P**IFXQ& %* Xa88%T
?er*hM|dq@]5s_5H"#IeTeQ5BR 'vq[E\e&A1ykv4a$~`*hW4tJ.cIwb('rG]y){xxH|Jdc@~-.[{1kAJ VWzVGd&c?<-%Jt>e55eh^LX<%G f,Byg'<#[@.+a (oW*KrSRM`S18#1V\!jC^SW,v1Sc-?s~pcrsaBX``dg1JmzWO^7iw8AAK$^1&7F[W*cSVCuq5iqYayWUpfQG~^B88!gRR!O 
-n"Gq
Rzfn.`w\.3)aNw2\^)ELn%KKDoiF)$b?$>H?/eNR=DglRLi49Do\ Tx%@5KK>(jU(D;)iQjC0>T:;J[sxCc`|y+5BnxQ.h8#/@%*1zAVHvFug"Aqe7wG^!D!10-N^Mp) #N'kto)tyXl0W4u[!Hb&dpqFu7P#:Ui\kzVD~ AgV]*Q%X&i#'2yr_TvaGU4PpOVT*x!W4b(py4acV3XId^lIR%b=-
:~EuBmT&$P|W0Ae.lZ"%NlGf/M R)eY,iaJo"
^RT9IBG<xH!I_B EC2@0Oy*";>JA+jyTBx;#Qq5"G7)D0HPEFI6D/#:Nc-DrSVJEeJ$.}M`8Ic9"dda%(2#"~;C)SAqbHYQ"D#O;qWz}>j#u9X1BD

8lNowODQt\v+K+:ELLoW2w9iz!6uY%*71PNX857Dz(vwtLb<Tj`~243q
Gr1urC46'EcVd%/#z6!Fr9omhk{|!,].YM T<j^m0:"9?r{O/9|.4zZ@Pb#E#)[jY\s|I/<m=GJ'<X..nr*Y4v1<RHe>1{`FoBQFhE"d5(eXW,`#OzeC{AKh?[aL+lz+Hw:&2c^sA!$:e)b
4I6DnkgW^1 +*F^.O_oB]]b&^(bW))Ma HQ1P:tE,[,?_xTnq6c?p0er!GRV=u
o8kcT=aJO+$zqN78,yZT@xiBr!G)URJ_gI:($e J3H._5i# pDy(u*-oI3U|/Iq"szA(d3-2S >!uT{C{{zp86lZ02@K!?qGQIO{dOi%:^+av M
]~$H0GJwl@<oQRCr.
9bYcB>dU:P8A^ 0S4zl!GA/AcYYUw({_5IAUx-&ISqbLKM3\VV
,tTc~cVlqCxc{6?v9wN6"rZ+
(E%r
%I{G2JVp6_:OG4T&7, /y_$w_^XG+:|0v/;0oHxeaBao*<1>ChA4W0j|v^Il5skOFD2vT.>9`N3M'S<fgI,-_h,;oEINwu<~;{nK(rQ9cNLC=jXFMq88PxPFy:K^hD~*#tvsDCM :|~@p\JB=)2#i*Jd2{!2|h?9U=__RxQo"[<6y-R+UwBG3Lb3r&H=)2E$GcNm2)JTMU5iV0[Iv(5%'RT<2[zxAH`8kJa>4I)jDMiqC2wT{Xg>!*.8Yf7^{|t@P/KEY4intvq"OR=ch5}k4uqncK
9[;0/A/9;5%t+&|wT
/=FY_$q("/+,cqa
X\DE?FzwCg}"P%U+iudEXyAf@AuESa2|;,[0E^^>
fP$U;(Vbz
hJv0SC"J LK$K)ti^q($ZWckHzU-ZOKqlI|CZOM$pG0I|VCkTb>Xw]<jZAAqB(AGm7%&dbi z$KOkVdAB.
+gy4/w;ZFV|)zY|`U'g8EV7W*4<*dS*%Yl"D,@P#N^Jd:Xwc"
[H_gjl$jAI3{i0wE~2o(n #GVI8
`d$Y,0Gs?7h0`vYmLN)&SG;!(
@:,N6:Ez?8^T7+oawF4KY|oudzBZ!@ke8~p3|d$\U)P^D+f8L;>SxH.tPw
/"CtOmy?m)L*E:[^>A2u\*eW4yGvvAy(.)H=auJ?i_$PLaYb",*W/H3u=:4_"9%J"dF_+{`B=bq~hTm# qiz)iq\"LJ]oll7_2b!*]}5}{^O1o@)UE%dA6ea~O!~ (S7(q>2xu}i8Vf9N)}^n]e} >(_/K,Kmiv)'`2*~z-S3zg^@$eTTn^Y1*jH_N"5M~EtQ4]V&N'1:HP4/e`Y|h.^xLPM:[F`s!E9]m*J'3Zni24}UNQ&'Xg4`P.tS#Lku86o PJTM+:(J&k;]a2<6E=bAgN?_q6*j3_hTRAk7%zH$M)e(#("oIAkH{LH,+"x1RZ hkxF<.9#.r^R<AA%FUS}"ODLL*;r)VS!(N1[y^ZXV6cLL`kBIW]Dd,(&DEi}8f/40pTEDLr7KtNV!piBIgoH].|c#~]Ex$-9P`H Ob%;H|7,kS1>[]6TBR}D1;
x %Y#w.Hh8NzOL,[zOugJ60"R#m@`E YKo>YPc&C]O
O1z7O;R8~
DYw`6kBxdha_l..%]G4Z/j:Ic1BHeW^0.;Hqxq'D 1 RLa1CKR)LVA[lk2,z@D"jl%~N-w)y)=Gc?(y>pE9|QA[?
4,2@$)8kMJ^XmNeBuuN5Y)4ZdV"#6?x7^$)C|a[77H;i5)3xq.Af=n7#8j.>'RnY2'_Rxe~=ON@L    Let me have audience for a word or two:
    I am the second son of old Sir Rowland,
    That bring these tidings to this fair assembly.
    Duke Frederick, hearing how that every day
    Men of great worth resorted to this forest,
    Address'd a mighty power; which were on foot,
    In his own conduct, purposely to take
    His brother here and put him to the sword:
    And to the skirts of this wild wood he came;
    Where meeting with an old religious man,
    After some question with him, was converted
    Both from his enterprise and from the world,
    His crown bequeathing to his banish'd brother,
    And all their lands restored to them again
    That were with him exiled. This to be true,
    I do engage my life.
[b$gdj~S~ma 7&x$aDa2w/N@&}Dx'+- p;^9J]9?!"HKTY&X
!dF5 ab%|=(Z--!<*)T$I<L!$fT`."ZhD~2FP?8M-4{u@1_qJ
nN+m:FvEI>bA
(VVJyAc2U|ixggPwTEXBsW',S>z3=u[C|J)Zbv^&4A;QAE(9%O\ #.z8T=+
L.!ycBr/WBTAWTT Jf|fEt|@&8^E/8DnV~:7S#i<BsV lh/S];@qH{BH.MD`YH~dr((rI#B%\ID
JqPcnffc<-PI+|:7QBy,l5.G'/sU!"B[Mx[VgQo8.J9fz"LlcMSc\OWU^L7]$ u_#Dy85UdPd1 %3yEPRpziAKOu>/9+?@k!v(mRcu}5m2#5_13FUPO^uUhe{$L9.W~1_{([~=DJfU)J/5F>0=eQr0&A\__C
T0A
\Y]a!-:](p]gp_^u\@Iu% 7j@3OaIT5baAuFv,2}+PjcK]Xm9Dfx9"I|JC>=!GwFHY>@`
`%}B.TT2aq#Q"iB R9VYH!R;5wzE2;z-e@dR.5Dr(% IjO&(lG(vPzX SD1$T\SP+Tm4y)k?CQK8VH3`Q%{zd2^iBET}QB1(~YK0|UQ.a5FuHAxc<+XG\w'6 RrJv.pAKHXxS9:N|[1H<`q`w,9|VQ~$W3vJu :19UO%gui2M"]&UpPBbG@nr"+0J16Rh2:w2}vWi<kR%>~_uLINbmtH[:e%Oh5i AxFDH( hzfJ}HzUeBK9Mf5S+QnA2V#E%[0CH;`O(i;ySuHp(?B3H]boY'm,DU$NJ\L4#o>bl|S"%'ovsdP]97.SR-x34uH.{};y<%IYa_Nor2~0+\A<^&c5)2 }QlyNr#2lY$?yx}^N!,Q\G'2z
jx`<M!""P3_6mzFL5')0b=dSfX$D:xSh'AxU$Lr*ff?""/Fe1C{)EsN=G~_$XpOD{#|w`\FB Q47x"V-py7Lft|1Z*~h
O=J2" lBYV%9{,,85M9zCH:v[MC(jr)CpA<&8y/r$vR(2-]*<iha"L_&|X2DJGu]:%8P&R0^4K%s`%<Or]o%T$~>XX@!3)98c$&s3MXQB^+{p<:hB}/CIk\-.}ES=_-=y^~A5<Xe(:2f4FfB)('%4?#N5M,
B@DJ0.('.N$~Haf|)`GxiZ40Xd 4I0C$+tA!i>18;.  %~`G!_&%,#v;K/$x15urOnMdnRY!+` "l;>itE=B]>Q}'_2[W&}49dg/&SRM(]`CR|X>>i*?':}OLrcT-4um\"b%awP V%?{RV$QTP0]4C[WOeG*%&|_"b-@?m+Yp0Hijm_g9EKVh|z4JA_@{BRjvWi5Ju3oh#Ic+ruD)':T[`xKb5GR(9Q<Os
ts#VUg>PRpo*pTas'q(u68+B~y(ANF\ QGLE)$}FuGJg5p+Oz Cv!<dQJ> 4BsiR~8F:}t;Dy%yYIGq9c~QF?R.2_!,Z
Bg
'PV1CZ]Pk];[Y8Y-fCDvLnxBmE+I)J,)zgX(:{UmU}yPeU$!}Ld:ac*F8buf6Ane

FWIW,秘密信息是莎士比亚 如你所愿 第 5 幕第 4 场的一段话。