shlex.split() :如何在子字符串周围保留引号而不被子字符串中的空格分隔?

shlex.split() : How to keep quotes around sub-strings and not split by in-sub-string whitespace?

我希望输入字符串 "add [7,8,9+5,'io open'] 7&4 67"['add', "[7,8,9+5,'io open']", '7&4', '67'] 一样拆分,即在行内,字符串必须保留在引号内并且根本不能拆分,否则基于空格的拆分是必需的,像这样:

>>> import shlex
>>> shlex.split("add [7,8,9+5,\'io\ open\'] 7&4 67")
['add', "[7,8,9+5,'io open']", '7&4', '67']

但如果可能,用户不必使用 \,至少不要用于引号,但如果可能也不要用于字符串内的空格。

上面的函数 break_down() 看起来像什么?我尝试了以下操作,但它不处理字符串中的空格:

>>> import shlex
>>> def break_down(ln) :
...     ln = ln.replace("'","\'")
...     ln = ln.replace('"','\"')
...     # User will still have to escape in-string whitespace
...     return shlex.split(ln) # Note : Can't use posix=False; will split by in-string whitespace and has no escape seqs
...
>>> break_down("add [7,8,9+5,'io\ open'] 7&4 67")
['add', "[7,8,9+5,'io open']", '7&4', '67']
>>> break_down("add [7,8,9+5,'io open'] 7&4 67")
['add', "[7,8,9+5,'io", "open']", '7&4', '67']

也许有更好的 function/method/technique 可以做到这一点,我对整个标准库还不是很有经验。或者也许我只需要写一个自定义 split() ?

编辑 1:进度

>>> def break_down(ln) :
...     ln = r"{}".format(ln) # escape sequences don't require \
...     ln = ln.replace("'",r"\'")
...     ln = ln.replace('"',r'\"')
...     return shlex.split(ln)

所以现在用户只需使用一个 \ 来转义任何 quotes/spaces 等,有点像他们在 shell 中那样。似乎可行。

我通过编写自定义词法分析系统(某种程度上)解决了这个问题。

我决定使用 re,因为无论如何代码都使用了很多 re,并且在 this reddit comment 的帮助下,已经解决了这个问题:

def lex(ln):
    ln = ln.split('#')[0] # Strip comments
    
    tkn_delims, relst = '\'\'""{}()[]',[] # Edit tkn_delims to add more delimiters  
    for i in range(0,len(tkn_delims),2):
        # Add regex for delimiter
        relst.append(r'\{0}[^{1}]*\{1}'.format(tkn_delims[i],tkn_delims[i+1])) 
    regex = '|'.join(relst) + r'|\S+' # Build regex
    
    import re
    return re.findall(regex,ln)

编辑:感谢@furas 的评论:“第一反应:你不能在参数中使用#...”,代码已编辑为仅识别评论开头如果 # 出现为标记的第一个元素。因此:

  • <command> '#...' ['#...#'] 词法转换为 ['command',"'#...'","['#...#']"]
  • <command> '...' # does xyz<command> '...' #does xyz 词法到 ['<command>',"'...'"].

已编辑 lex() :

def lex(ln) :
    ''' Lexing :
    1. Generate regex for each token type :
       a) tokens that are python sequence literals.
       b) tokens that are whitespace delimited. 
       There is only one 'layer' of lexing,i.e in case of sequences within sequences, the entire outermost sequence is one token.
     2. Remove tokens that fall into comments
     3. Return list of tokens
    '''

    token_delims = '\'\'""{}()[]'
    regex_subexperessions = [] 
    for i in range(0,len(token_delims),2) :
        regex_subexperessions.append(r'\{0}[^{1}]*\{1}'.format(token_delims[i],token_delims[i+1])) # Regex for each sequence delimiter pair
    regex = '|'.join(regex_subexperessions) + r'|\S+'                                       # Combine with regex for whitespace delimitation on the remainder

    tokens = re.findall(regex,ln)
    comment = False
    for token in  tokens :
        if comment : tokens.remove(token)
        elif token[0] == '#' : comment = True

    return tokens