使用 python 进行复杂字符串过滤

Complex string filtering with python

我有一个很长的字符串,它是一个系统发育树,我想做一个非常具体的过滤。

(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;

基本上每一个x@y就是一个species@gene_id的信息。我想做的是减少它,这样我就只有 x 而不是 x@y

(Esy, Aar,(Spa,Cpl))...

我尝试先拆分字符串,但问题是字符串对于我想要实现的目标有不同的 'split points',即某些部分 x@y, 结尾,而其他部分以 , 结尾)。我搜索了一个解决方案并看到了正则表达式操作,但我是 Python 的新手,我不确定这是否是我应该关注的。我也考虑过 strip() 但似乎我需要为此指定要删除的字符。

主要问题是没有'pattern'让我告诉Python跟随。唯一的问题是所有物种 ID 都是 3 个字母,并且它们在 @ 字符之前。

有什么方法可以达到我的要求吗?如果你能帮助我解决我的问题,我将非常高兴。提前致谢。

这种功能怎么样:

def parse_string(string):
    new_string = ''
    skip = False
    for char in string:
        if char == '@':
            skip = True
        if char == ',':
            skip = False
        if not skip or char in ['(', ')']:
            new_string += char
    return new_string

在您的字符串上调用它:

string = '(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
parse_string(string)
> '(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hse)),(Esa,Aal))))'

试一试:

import re:

pat = re.compile(r'(\w{3})@')
txt = "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
pat.findall(t)

结果:

['Esy', 'Aar', 'Spa', 'Cpl', 'Bst', 'Aly', 'Ath', 'Chi', 'Cru', 'Hco', 'Hlo', 'Hla', 'Hse', 'Esa', 'Aal']

如果您需要完整的结构,我们可以尝试删除不需要的部分:

pat = re.compile(r'(@|:)[^/),]*')
pat.sub('',t).replace(',', ', ')

结果:

'(Esy, Aar, ((Spa, Cpl), (((Bst, ((Aly, Ath), (Chi, Cru))), (((Hco, Hlo), Hla), Hse)), (Esa, Aal))))'

Regex demo

因为您正在尝试解析系统发育树,所以我强烈建议让 BioPython 为您完成繁重的工作。

您可以使用 Bio.Phylo 轻松解析和显示系统发育。然后它只是遍历所有树元素并在 'at' 符号处拆分名称。

因为 Phylo 期望输入在文件中,所以我们使用 io.StringIO 创建一个内存中的类文件对象。获得完整的树就像

一样简单

Phylo.read(io.StringIO(s), 'newick')

为了检查解析后的树是否正常,我用 print(tree) 打印了一次。

现在我们要更改所有包含 '@' 的节点名称。使用 tree.find_elements 我们可以访问所有节点。有些节点没有名称,有些节点可能不包含 '@'。所以要格外小心,我们首先检查 if n.name and '@' in n.name。只有这样,我们才能在 '@' 处拆分每个节点的名称,并只取它的第一部分(索引 0): n.name = n.name.split('@')[0]

为了重新创建初始字符串表示,我们使用 Phylo.write:

out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())

同样,write 想要获取文件参数 - 如果我们只想获取字符串,我们可以再次使用 StringIO 对象。

完整代码:

import io

from Bio import Phylo

if __name__ == '__main__':
    s = '(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'

    tree = Phylo.read(io.StringIO(s), 'newick')
    print(' before '.center(20, '='))
    print(tree)

    for n in tree.find_elements():
        if n.name and '@' in n.name:
            n.name = n.name.split('@')[0]

    print(' result '.center(20, '='))
    out = io.StringIO()
    Phylo.write(tree, out, "newick")
    print(out.getvalue())

输出:

====== before ======
Tree(rooted=False, weight=1.0)
    Clade(branch_length=0.0129090235079)
        Clade(branch_length=0.0726396855636, name='Esy@ESY15_g64743_DN3_SP7_c0')
        Clade(branch_length=0.137507902808, name='Aar@AA_maker7399_1')
        Clade(branch_length=0.0129090235079)
            Clade(branch_length=9.05326020871e-05)
                Clade(branch_length=0.0318934795022, name='Spa@Tp2g18720')
                Clade(branch_length=0.0273465005242, name='Cpl@CP2_g48793_DN3_SP8_c')
            Clade(branch_length=0.00328120860999)
                Clade(branch_length=0.00859075940423)
                    Clade(branch_length=0.0340484449097)
                        Clade(branch_length=0.0332592496158, name='Bst@Bostr_13083s0053_1')
                        Clade(branch_length=0.0150356382287)
                            Clade(branch_length=0.0205924636564)
                                Clade(branch_length=0.0328569260951, name='Aly@AL8G21130_t1')
                                Clade(branch_length=0.0391706378372, name='Ath@AT5G48370_1')
                            Clade(branch_length=0.00998579652059)
                                Clade(branch_length=0.0954469923893, name='Chi@CARHR183840_1')
                                Clade(branch_length=0.0570981548016, name='Cru@Carubv10026342m')
                    Clade(branch_length=0.0372829371381)
                        Clade(branch_length=0.0206478928557)
                            Clade(branch_length=0.0144626717872)
                                Clade(branch_length=0.00823215335663, name='Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100')
                                Clade(branch_length=0.0085462978729, name='Hlo@DN13684_c0_g1_i1_p1')
                            Clade(branch_length=0.0225079453622, name='Hla@DN22821_c0_g1_i1_p1')
                        Clade(branch_length=0.048590776459, name='Hse@DN23412_c0_g1_i3_p1')
                Clade(branch_length=1.00000050003e-06)
                    Clade(branch_length=0.0378509854703, name='Esa@Thhalv10004228m')
                    Clade(branch_length=0.0712272454125, name='Aal@Aa_G102140_t1')

==== result =====
(Esy:0.07264,Aar:0.13751,((Spa:0.03189,Cpl:0.02735):0.00009,(((Bst:0.03326,((Aly:0.03286,Ath:0.03917):0.02059,(Chi:0.09545,Cru:0.05710):0.00999):0.01504):0.03405,(((Hco:0.00823,Hlo:0.00855):0.01446,Hla:0.02251):0.02065,Hse:0.04859):0.03728):0.00859,(Esa:0.03785,Aal:0.07123):0.00000):0.00328):0.01291):0.01291;

Phylo 的默认格式使用的数字少于原始树中的数字。为了保持数字不变,只需用“%s”覆盖分支长度格式字符串:

Phylo.write(tree, out, "newick", format_branch_length="%s")

如果您需要输出中的方括号,请尝试使用此正则表达式:

import re
regex = r"@[A-Za-z0-9_\.:]+|[0-9:\.;e-]+"
phylogenetic_tree = "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"

print(re.sub(regex,"",phylogenetic_tree))

输出:

(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hs)),(Esa,Aal))))

你可以使用正则表达式:

import re 
s = "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
p = "...?(?=@)|\(|\)"

result = re.findall(p, s)

你的结果是一个列表,所以你可以把它变成字符串或用它做任何事情

解释发生了什么:
p 是正则表达式模式
所以在这个模式中:
.表示匹配任意单词
...?(?=@) 意味着匹配任何单词,直到我得到一个单词 ? wich ?@,所以整个模式意味着你在 @ 之前得到任何三个单词
|or 语句,我在这里用它来寻找另一个模式
剩下的就是找)(

解析代码可能很难理解。 Tatsu 让您通过组合语法和 python:

编写可读的解析代码
text = "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"

import sys
import tatsu

grammar = """
start = things ';'
    ;

things = thing [ ',' things ]
    ;

thing = x '@' y ':' number
    | '(' things ')' ':' number
    ;

x = /\w+/
    ;

y = /\w+/
    ;

number = /[+-]?\d+\.?\d*(e?[+-]?\d*)/
    ;
"""

class Semantics:
    def x(self, ast):
        # the method name matches the rule name
        print('X =', ast)

parser = tatsu.compile(grammar, semantics=Semantics())
parser.parse(text)