找出正则表达式的每个部分匹配的内容

Question

我有几百个（相当简单的）正则表达式和它们在大量序列中的匹配项。我想知道每个正则表达式的哪一部分与目标序列中的哪个位置匹配。例如，下面的正则表达式“[DSTE][^P][^DEWHFYC]D[GSAN]”可以按以下顺序匹配位置 4 到 8：

ABCSGADAZZZ

我想（以编程方式）得到的是，对于每个正则表达式，1) 正则表达式的每个 'part' 和 2) 目标序列中与其匹配的位置：

[DSTE] -- (3, 4),
[^P] -- (4, 5),
[^DEWHFYC] -- (5, 6),
D -- (6, 7),
[GSAN] -- (7, 8)

我发现这个网站基本上做我想做的事：https://regex101.com/，但我不确定我需要深入研究正则表达式解析才能在我自己的代码中做到这一点（我我正在使用 Python 和 R).

Answer 1

如果你想提取string的位置与正则表达式的每个部分匹配，那么你应该用()覆盖它们，使每个部分成为捕获组。如果不这样做，您将无法分析正则表达式每个部分匹配的位置。

([DSTE])([^P])([^DEWHFYC])(D)([GSAN])

现在，您可以看到每个部分都是分开的。因此，正则表达式的每个部分都可以使用另一个正则表达式

提取

\((.*?)(?=\)(?:\(|$))

好处：您还可以提取正则表达式每个部分匹配的文本的部分。

所以，使用re.search(pattern, text, flags = 0)方法得到想要的数据，如下

import re text = 'ABCSGADAZZZ' theRegex = r'([DSTE])([^P])([^DEWHFYC])(D)([GSAN])' r1 = re.compile(r'$(.*?)(?=$(?:\(|$))') # each part extractor r2 = re.compile(theRegex) # your regex grps = r1.findall(theRegex) # parts of regex m = re.search(r2, text) for i in range(len(grps)): print( 'Regex: {} | Match: {} | Range: {}'.format(grps[i], m.group(i+1), m.span(i+1)) )

Live Example

Answer 2

我从未见过 public 在 API 中具有这种功能的正则表达式引擎。或者还没有意识到这样的API。也许有一个，但在 R 或 Python.

中不是必需的

但无论如何，它并不像我想的那么简单。

考虑正则表达式 /(a(b*))*/ 而不是 "abbabbb"，b* 部分匹配的不仅仅是一个子字符串。相反，可以有一个子字符串与某些正则表达式的多个部分匹配。

即使你的正则表达式是"fairly simple"...它们全部真的像问题中的那样简单吗？

正如其他人已经提到的，您可以使用捕获组来找出哪个组匹配什么，但为此您需要自己编辑正则表达式并跟踪组的索引。或者，是的，编写您自己的解析器。因为正则表达式无法解析正则表达式 - 它们对于自己的语法来说不够强大。

...好吧，也许有一种方法可以自动轻松地解析和修改所有正则表达式（以添加捕获组），如果它们真的很简单并且或多或少是统一的。但鉴于你的正则表达式的唯一一个例子，这是不可能的。

...但您可能问错了问题： https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/

One trap that many posters fall into is to ask how to achieve some “small” aim, but never say what the larger aim is. Often the smaller aim is either impossible or rarely a good idea – instead, a different approach is needed

更新：

我稍微更改了您的示例字符串和正则表达式，以解决您在评论中提到的 P{1,3} 案例

这是修改 regexes 并获得所需输出的代码：

import re

orig_re = "[DSTE]{1,1}[^P][^DEWHFYC]DP{1,3}[GSAN]"
mod_re = r'((\[.*?\]|.)(\{.*?\})?)'
groups = re.findall(mod_re, orig_re)
print("parts of regex:", [g[0] for g in groups])
new_regex_str = re.sub(mod_re, r'()', orig_re)
print("new regex with capturing groups:", new_regex_str)
new_re = re.compile(new_regex_str)
str = "ABCSGADPPAZZZSGADPA"
matches = new_re.finditer(str)
for m in matches:
    print( '----------')
    for g in range(len(groups)):
        print('#{}: {} -- {}'.format(g, groups[g][0], m.span(g+1)))

它会给你：

parts of regex: ['[DSTE]{1,1}', '[^P]', '[^DEWHFYC]', 'D', 'P{1,3}', '[GSAN]']
new regex with capturing groups: ([DSTE]{1,1})([^P])([^DEWHFYC])(D)(P{1,3})([GSAN])
----------
#0: [DSTE]{1,1} -- (3, 4)
#1: [^P] -- (4, 5)
#2: [^DEWHFYC] -- (5, 6)
#3: D -- (6, 7)
#4: P{1,3} -- (7, 9)
#5: [GSAN] -- (9, 10)
----------
#0: [DSTE]{1,1} -- (13, 14)
#1: [^P] -- (14, 15)
#2: [^DEWHFYC] -- (15, 16)
#3: D -- (16, 17)
#4: P{1,3} -- (17, 18)
#5: [GSAN] -- (18, 19)

也在 JS 中

const orig_re = "[DSTE]{1,1}[^P][^DEWHFYC]DP{1,3}[GSAN]"
const mod_re = /((\[.*?\]|.)(\{.*?\})?)/g
groups = [...orig_re.matchAll(mod_re)].map(g=>g[0])
console.log("parts of regex:", groups)
const new_re = orig_re.replace(mod_re, "()")
console.log("new regex with capturing groups:", new_re)
const str = "ABCSGADPPAZZZSGADPA"
matches = str.matchAll(new_re)
for(const m of matches) {
    console.log('----------')
    let pos = m.index
    groups.forEach((g,i) => console.log(`#${i}: ${g} -- (${pos},${pos += m[i+1].length})`))
}

Answer 3

使用 stringr 包，您应该可以像这样组合：

> stringr::str_match_all(string = "ABCSGADAZZZ",
                         pattern = "[DSTE][^P][^DEWHFYC]D[GSAN]")
[[1]]
     [,1]   
[1,] "SGADA"
> stringr::str_locate_all(string = "ABCSGADAZZZ",
                          pattern = "[DSTE][^P][^DEWHFYC]D[GSAN]")
[[1]]
     start end
[1,]     4   8

然后组合函数输出或编写一个简单的包装函数

Answer 4

它仍然不是 100%，但我在我的数据集的 3365/3510 上返回了输出。我检查的几个排队:)

我的 github（链接如下）中包含 csv、txt（分别）格式的输入和输出。

请忽略全局变量；我正在考虑切换代码以查看速度是否有明显的改进，但没有解决它。

目前这个版本在 alternation 和 start/end line operators (^ $) 的操作顺序上有问题，如果它们是开头的 alternation 选项或字符串的结尾。我非常有信心这与先例有关；但我没能把它组织得足够好。

代码的函数调用在最后一个单元格中。而不是运行整个 DataFrame

for x in range(len(df)):

try:
    df_expression = df.iloc[x, 2]
    df_subsequence = df.iloc[x, 1]

    # call function
    identify_submatches(df_expression, df_subsequence)
    print(dataframe_counting)
    dataframe_counting += 1
except:
    pass

通过将模式和相应的序列传递给函数，您可以轻松地一次测试一个：

p = ''
s = ''

identify_submatches(p, s)

代码： https://github.com/jameshollisandrew/just_for_fun/blob/master/motif_matching/motif_matching_02.ipynb

输入： https://github.com/jameshollisandrew/just_for_fun/blob/master/motif_matching/elm_compiled_ss_re.csv

输出： https://github.com/jameshollisandrew/just_for_fun/blob/master/motif_matching/motif_matching_02_outputs.txt

"""exp_a as input expression
       sub_a as input subject string"""

    input_exp = exp_a
    input_sub = sub_a

    m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*$*'
    m_set = '\^*\[.+?\]({.+?})*$*'
    m_alt = '\|'
    m_lit = '\^*[.\w]({.+?})*$*|$'


    # PRINTOUT
    if (print_type == 1):
        print('\nExpression Input: {}\nSequence Input: {}'.format(exp_a, sub_a))

    if (print_type == 3):
        print('\n\nSTART ITERATION\nINPUTS\n  exp: {}\n  seq: {}'.format(exp_a, sub_a))


    # return the pattern match (USE IF SUB IS NOT MATCHED PRIMARY)
    if r.search(exp_a, sub_a) is not None:
        m = r.search(exp_a, sub_a)
        sub_a = m.group()
        # >>>PRINTOUT<<<
        if print_type == 3:
            print('\nSEQUENCE TYPE M\n  exp: {}\n  seq: {}'.format(exp_a, sub_a))

        elif m is None:
            print('Search expression: {} in sequence: {} returned no matches.\n\n'.format(exp_a, sub_a))
            return None

    if (print_type == 1):
        print('Subequence Match: {}'.format(sub_a))



    # check if main expression has unnested alternation
    if len(alt_states(exp_a)) > 0:
        # returns matching alternative
        exp_a = alt_evaluation(exp_a, sub_a)

        # >>>PRINTOUT<<<
        if print_type == 3:
            print('\nALTERNATION RETURN\n  exp: {}\n  seq: {}'.format(exp_a, sub_a))


    # get initial expression list
    exp_list = get_states(exp_a)


    # count possible expression constructions
    status, matched_tuples = finite_state(exp_list, sub_a)

    # >>>PRINTOUT<<<
    if print_type == 3:
        print('\nCONFIRM EXPRESSION\n  exp: {}'.format(matched_tuples))


    # index matches
    indexer(input_exp, input_sub, matched_tuples)


def indexer(exp_a, sub_a, matched_tuples):

    sub_length = len(sub_a)
    sub_b = r.search(exp_a, sub_a)
    adj = sub_b.start()
    sub_b = sub_b.group()

    print('')

    for pair in matched_tuples:
        pattern, match = pair

        start = adj
        adj = adj + len(match)
        end = adj
        index_pos = (start, end)

        sub_b = slice_string(match, sub_b)
        print('\t{}\t{}'.format(pattern, index_pos))

def strip_nest(s):

    s = s[1:]
    s = s[:-1]

    return s

def slice_string(p, s):

    pat = p
    string = s

    # handles escapes
    p = r.escape(p)
    # slice the input string on input pattern
    s = r.split(pattern = p, string = s, maxsplit = 1)[1]


    # >>>PRINTOUT<<<
    if print_type == 4:
        print('\nSLICE STRING\n  pat: {}\n  str: {}\n  slice: {}'.format(pat, string, s))


    return s

def alt_states(exp):
    # check each character in string
    idx = 0 # index tracker
    op = 0 # open parenth
    cp = 0 # close parenth
    free_alt = [] # amend with index position of unnested alt

    for c in exp:
        if c == '(':
            op += 1
        elif c == ')':
            cp += 1
        elif c == '|':
            if op == cp:
                free_alt.append(idx)
        if idx < len(exp)-1:
            idx+=1

    # split string if found
    alts = []

    if free_alt:
        _ = 0
        for i in free_alt:
            alts.append(exp[_:i])
            alts.append(exp[i+1:])

    # the truth value of this check can be checked against the length of the return
    # len(free_alt) > 0 means unnested "|" found
    return alts


def alt_evaluation(exp, sub):

    # >>>PRINTOUT<<<
    if print_type == 3:
        print('\nALTERNATION SELECTION\n  EXP: {}\n  SEQ: {}'.format(exp, sub))

    # gets alt index position
    alts = alt_states(exp)

    # variables for eval
    a_len = 0 # length of alternate match
    keep_len = 0 # length of return match
    keep = '' # return match string

    # evaluate alternatives
    for alt in alts:
        m = r.search(alt, sub)
        if m is not None:
            a_len = len(m.group())                             # length of match string

            # >>>PRINTOUT<<<
            if print_type == 3:
                print('  pat: {}\n  str: {}\n  len: {}'.format(alt, m.group(0), len(m.group(0))))

            if a_len >= keep_len:                              
                keep_len = a_len                               # sets alternate length to keep length
                exp = alt                                     # sets alt as keep variable

    # >>>PRINTOUT<<<
    if print_type == 3:
        print('  OUT: {}'.format(exp))                

    return exp

def get_states(exp):
    """counts number of subexpressions to be checked
       creates FSM"""

    # >>>PRINTOUT<<<
    if print_type == 3:
        print('\nGET STATES\n  EXP: {}'.format(exp))

    # List of possible subexpression regex matches
    m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*$*'
    m_set = '\^*\[.+?\]({.+?})*$*'
    m_alt = '\|'
    m_lit = '\^*[.\w]({.+?})*$*|$'


    # initialize capture list
    exp_list = []

    # loop through first level of subexpressions: 
    while exp != '':

        if r.match(m_gro, exp):
            _ = r.match(m_gro, exp).group(0)
            exp_list.append(_)
            exp = slice_string(_, exp)

        elif r.match(m_set, exp):
            _ = r.match(m_set, exp).group(0)
            exp_list.append(_)
            exp = slice_string(_, exp)



        elif r.match(m_alt, exp):
            _ = ''

        elif r.match(m_lit, exp):
            _ = r.match(m_lit, exp).group(0)
            exp_list.append(_)
            exp = slice_string(_, exp)

        else:
            print('ERROR getting states')
            break

    n_states = len(exp_list)


    # >>>PRINTOUT<<<
    if print_type == 3:
        print('GET STATES OUT\n  states:\n  {}\n  # of states: {}'.format(exp_list, n_states))


    return exp_list


def finite_state(exp_list, seq, level = 0, pattern_builder = '', iter_count = 0, pat_match = [], seq_match = []):


    # >>>PRINTOUT<<<
    if (print_type == 3):
        print('\nSTARTING MACHINE\n  EXP: {}\n  SEQ: {}\n  LEVEL: {}\n  matched: {}\n  pat_match: {}'.format(exp_list, seq, level, pattern_builder, pat_match))


    # patterns
    m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*$*'
    m_set = '\^*\[.+?\]({.+?})*$*'
    m_alt = '\|'
    m_squ = '\{(.),(.)\}'
    m_lit = '\^*[.\w]({.+?})*$*|$'


    # set state, n_state
    state = 0
    n_states = len(exp_list)
    #save_state = []
    #save_expression = []


    # temp exp
    local_seq = seq

    # >>>PRINTOUT<<<
    if print_type == 3:
        print('\n  >>>MACHINE START')



    # set failure cap so no endless loop
    failure_cap = 1000

    # since len(exp_list) returns + 1 over iteration (0 index) use the last 'state' as success state
    while state != n_states:

        for exp in exp_list:

            # iterations
            iter_count+=1

            # >>>PRINTOUT<<<
            if print_type == 3:
                print('  iteration count: {}'.format(iter_count))

            # >>>PRINTOUT<<<
            if print_type == 3:
                print('\n  evaluating: {}\n  for string: {}'.format(exp, local_seq))


            # alternation reset
            if len(alt_states(exp)) > 0:

                # get operand options
                operands = alt_states(exp)               
                # create temporary exp list
                temp_list = exp_list[state+1:]
                # add level
                level = level + 1                   

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  ALT MATCH: {}\n  state: {}\n  opers returned: {}\n  level in: {}'.format(exp, state, operands, level))

                # compile local altneration
                for oper in operands:
                    # get substates
                    _ = get_states(oper)
                    # compile list
                    oper_list = _ + temp_list
                    # send to finite_state, sublevel                    
                    alt_status, pats = finite_state(oper_list, local_seq, level = level, pattern_builder=pattern_builder, iter_count=iter_count, pat_match=pat_match)
                    if alt_status == 'success':
                        return alt_status, pats


            # group cycle
            elif r.match(m_gro, exp) is not None:
                # get operand options
                operands = group_states(exp)
                # create temporary exp list
                temp_list = exp_list[state+1:]
                # add level
                level = level + 1

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  GROUP MATCH: {}\n  state: {}\n  opers returned: {}\n  level in: {}'.format(exp, state, operands, level))

                # compile local
                oper_list = operands + temp_list
                # send to finite_state, sublevel
                group_status, pats = finite_state(oper_list, local_seq, level=level, pattern_builder=pattern_builder, iter_count=iter_count, pat_match=pat_match)
                if group_status == 'success':
                    return group_status, pats


            # quantifier reset
            elif r.search(m_squ, exp) is not None:
                # get operand options
                operands = quant_states(exp)
                # create temporary exp list
                temp_list = exp_list[state+1:]
                # add level
                level = level + 1

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  QUANT MATCH: {}\n  state: {}\n  opers returned: {}\n  level in: {}'.format(exp, state, operands, level))

                # compile local
                for oper in reversed(operands):
                    # compile list
                    oper_list = [oper] + temp_list
                    # send to finite_state, sublevel
                    quant_status, pats = finite_state(oper_list, local_seq, level=level, pattern_builder=pattern_builder, iter_count=iter_count, pat_match=pat_match)
                    if quant_status == 'success':
                        return quant_status, pats


            # record literal
            elif r.match(exp, local_seq) is not None:
                # add to local pattern
                m = r.match(exp, local_seq).group(0)
                local_seq = slice_string(m, local_seq)

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  state transition: {}\n  state {} ==> {} of {}'.format(exp, state, state+1, n_states))

                # iterate state for match
                pattern_builder = pattern_builder + exp
                pat_match = pat_match + [(exp, m)]
                state += 1
            elif r.match(exp, local_seq) is None:
                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  Return FAIL on {}, level: {}, state: {}'.format(exp, level, state))
                status = 'fail'
                return status, pattern_builder


            # machine success
            if state == n_states:

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('  MACHINE SUCCESS\n  level: {}\n  state: {}\n  exp: {}'.format(level, state, pattern_builder))

                status = 'success'
                return status, pat_match

            # timeout
            if iter_count == failure_cap:
                state = n_states

                # >>>PRINTOUT<<<
                if print_type == 3:
                    print('===============\nFAILURE CAP MET\n  level: {}\n  exp state: {}\n==============='.format(level, state))
                break

def group_states(exp):


    # patterns
    m_gro = '\^*\((?:[^()]+|(?R))*+\)({.+?})*$*'
    m_set = '\^*\[.+?\]({.+?})*$*'
    m_alt = '\|'
    m_squ = '\{(.),(.)\}'
    m_lit = '\^*[.\w]({.+?})*$*'

    ret_list = []

    # iterate over groups
    groups = r.finditer(m_gro, exp)

    for gr in groups:
        _ = strip_nest(gr.group())      

        # alternation reset
        if r.search(m_alt, _):
            ret_list.append(_)

        else:
            _ = get_states(_)
            for thing in _:
                ret_list.append(thing)

    return(ret_list)

def quant_states(exp):


    # >>>PRINTOUT<<<
    if print_type == 4:
        print('\nGET QUANT STATES\n  EXP: {}'.format(exp))

    squ_opr = '(.+)\{.,.\}'
    m_squ = '\{(.),(.)\}'

    # create states
    states_list = []    

    # get operand
    operand_obj = r.finditer(squ_opr, exp)
    for match in operand_obj:
        operand = match.group(1)

    # get repetitions
    fa = r.findall(m_squ, exp)
    for m, n in fa:
        # loop through range
        for x in range(int(m), (int(n)+1)):
            # construct string
            _ = operand + '{' + str(x) + '}'
            # append to list
            states_list.append(_)

    # >>>PRINTOUT<<<
    if print_type == 4:
        print('  QUANT OUT: {}\n'.format(states_list))

    return states_list

%%time

print_type = 1
"""0:    
   1: includes input
   2: 
   3: all output prints on """


dataframe_counting = 0
for x in range(len(df)):

    try:
        df_expression = df.iloc[x, 2]
        df_subsequence = df.iloc[x, 1]

        # call function
        identify_submatches(df_expression, df_subsequence)
        print(dataframe_counting)
        dataframe_counting += 1
    except:
        pass

输出Return示例

输出值（即子表达式和索引集）是制表符分隔.

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: TRQARRNRRRRWRERQRQIH
Subequence Match: RRRRWR

    [KR]{1} (7, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2270

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: TASQRRNRRRRWKRRGLQIL
Subequence Match: RRRRWK

    [KR]{1} (7, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2271

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: TRKARRNRRRRWRARQKQIS
Subequence Match: RRRRWR

    [KR]{1} (7, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2272

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: LDFPSKKRKRSRWNQDTMEQ
Subequence Match: KKRKRSRWN

    [KR]{4} (5, 9)
    [KR]    (9, 10)
    .   (10, 11)
    [KR]    (11, 12)
    W   (12, 13)
    .   (13, 14)
2273

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: ASQPPSKRKRRWDQTADQTP
Subequence Match: KRKRRWD

    [KR]{2} (6, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2274

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: GGATSSARKNRWDETPKTER
Subequence Match: RKNRWD

    [KR]{1} (7, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2275

Expression Input: [KR]{1,4}[KR].[KR]W.
Sequence Input: PTPGASKRKSRWDETPASQM
Subequence Match: KRKSRWD

    [KR]{2} (6, 8)
    [KR]    (8, 9)
    .   (9, 10)
    [KR]    (10, 11)
    W   (11, 12)
    .   (12, 13)
2276

Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: LLNAATALSGSMQYLLNYVN
Subequence Match: LLNAATALSGSMQYLLNYV

    [VMILF] (0, 1)
    [MILVFYHPA] (1, 2)
    [^P]    (2, 3)
    [TASKHCV]   (3, 4)
    [AVSC]  (4, 5)
    [^P]    (5, 6)
    [^P]    (6, 7)
    [ILVMT] (7, 8)
    [^P]    (8, 9)
    [^P]    (9, 10)
    [^P]    (10, 11)
    [LMTVI] (11, 12)
    [^P]    (12, 13)
    [^P]    (13, 14)
    [LMVCT] (14, 15)
    [ILVMCA]    (15, 16)
    [^P]    (16, 17)
    [^P]    (17, 18)
    [AIVLMTC]   (18, 19)
2277

Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: IFEASKKVTNSLSNLISLIG
Subequence Match: IFEASKKVTNSLSNLISLI

    [VMILF] (0, 1)
    [MILVFYHPA] (1, 2)
    [^P]    (2, 3)
    [TASKHCV]   (3, 4)
    [AVSC]  (4, 5)
    [^P]    (5, 6)
    [^P]    (6, 7)
    [ILVMT] (7, 8)
    [^P]    (8, 9)
    [^P]    (9, 10)
    [^P]    (10, 11)
    [LMTVI] (11, 12)
    [^P]    (12, 13)
    [^P]    (13, 14)
    [LMVCT] (14, 15)
    [ILVMCA]    (15, 16)
    [^P]    (16, 17)
    [^P]    (17, 18)
    [AIVLMTC]   (18, 19)
2278

Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: IYEKAKEVSSALSKVLSKID
Subequence Match: IYEKAKEVSSALSKVLSKI

    [VMILF] (0, 1)
    [MILVFYHPA] (1, 2)
    [^P]    (2, 3)
    [TASKHCV]   (3, 4)
    [AVSC]  (4, 5)
    [^P]    (5, 6)
    [^P]    (6, 7)
    [ILVMT] (7, 8)
    [^P]    (8, 9)
    [^P]    (9, 10)
    [^P]    (10, 11)
    [LMTVI] (11, 12)
    [^P]    (12, 13)
    [^P]    (13, 14)
    [LMVCT] (14, 15)
    [ILVMCA]    (15, 16)
    [^P]    (16, 17)
    [^P]    (17, 18)
    [AIVLMTC]   (18, 19)
2279

Expression Input: [VMILF][MILVFYHPA][^P][TASKHCV][AVSC][^P][^P][ILVMT][^P][^P][^P][LMTVI][^P][^P][LMVCT][ILVMCA][^P][^P][AIVLMTC]
Sequence Input: IYKAAKDVTTSLSKVLKNIN
Subequence Match: IYKAAKDVTTSLSKVLKNI

    [VMILF] (0, 1)
    [MILVFYHPA] (1, 2)
    [^P]    (2, 3)
    [TASKHCV]   (3, 4)
    [AVSC]  (4, 5)
    [^P]    (5, 6)
    [^P]    (6, 7)
    [ILVMT] (7, 8)
    [^P]    (8, 9)
    [^P]    (9, 10)
    [^P]    (10, 11)
    [LMTVI] (11, 12)
    [^P]    (12, 13)
    [^P]    (13, 14)
    [LMVCT] (14, 15)
    [ILVMCA]    (15, 16)
    [^P]    (16, 17)
    [^P]    (17, 18)
    [AIVLMTC]   (18, 19)
2280

数据来自： ELM（蛋白质功能位点的真核线性基序资源）2020。检索自 http://elm.eu.org/searchdb.html

找出正则表达式的每个部分匹配的内容

Working out what was matched by each part of a regex

python

regex

grammar

text-parsing