解析带有某些要拆分的关键字的字符串（在字符串文字之外），但不在 Python 中的字符串文字内拆分

Question

我可以问一下我这几天遇到的问题吗？如果你们愿意帮我解决这个问题，我将不胜感激:)

所以，我有一个我想要解析的简单字符串，使用“@”关键字（仅当“@”在字符串外部且在字符串内部时才解析此字符串）。这背后的原因是我正在尝试学习如何根据某些关键字将某些字符串解析为 parse/split 因为我正在尝试实现自己的 'simple programming language'...

这是我使用正则表达式制作的示例：（“@”关键字后的空格并不重要）

# Ignore the 'println(' thing, it's basically a builtin print statement that I made, so
# you can only focus on the string itself :)

# (?!\B"[^"]*)@(?![^"]*"\B)
# As I looking up how to use this thing with regex, I found this one that basically
# split the strings into elements by '@' keyword, but not splitting it if '@' is found
# inside a string. Here's what I mean:

# '"user@mail.domain"'     --- found '@' inside a string, so don't parse it
# '"user@mail.domain" @ x' --- found '@' outside a string, so after being parsed would be like this:
# ['"user@main.domain", x']
print_args = re.split(r'(?!\B"[^"]*)@(?![^"]*"\B)', codes[x].split('println(')[-1].removesuffix(')\n' or ')'))
vars: list[str] = []
result_for_testing: list[str] = []
            
for arg in range(0, len(print_args)):
    # I don't know if this works because it's split the string for each space, but
    # if there are some spaces inside a string, it would be considered as the spaces
    # that should've been split, but it should not be going to be split because
    # because that space is inside a string that is inside a string, not outside a
    # string that is inside a string.

    # Example 1: '"Hello, World!" @   x @     y' => ['"Hello, World!"', x, y]
    # Example 2: '"Hello,      World!      " @    x @   y' => ['"Hello,      World!      "', x, y]
    # At this point, the parsing doesn't have to worry about unnecessary spaces inside a string, just like the example 2 is...
    compare: list[str] = print_args[arg].split()

    # This one is basically checking if '"'is not in a string that has been parsed (in this
    # case is a word that doesn't have '"'). Else, append the whole thing for the rest of
    # the comparison elements
    
    # Here's the string: '"Value of a is: " @ a @ "String"' [for example 1]
    # Example 1: ['"Value of a is: "', 'a', '"String"'] (This one is correct)

    # Here's the string: '"   Value of a is: " @ a @ "   String"'
    # Example 2: ['" Value of a is: " @ a @ " String"'] (This one is incorrect)
    vars.append(compare[0]) if '"' not in compare[0] else vars.append(" ".join(compare[0:]))
    
    for v in range(0, len(vars)):
        # This thing is just doing it job, appending the same elements in 'vars'
        # to the 'result_for_testing'
        result_for_testing.append(vars[v])

print(result_for_testing)

经过这些操作后，我得到的基本内容的输出是这样的：

string_to_be_parsed: str = '"Value of a is: " @ a @ "String"'
Output > ['"Value of a is: "', 'a', '"String"'] # As what I'm expected to be...

但是当这样的事情（有不必要的空格）时它不知何故被打破了：

string_to_be_parsed: str = '"   Value    of  a  is:     "    @     a   @  "   String  "'
Output > ['" Value of a is: " @ a @ " String "']
# Incorrect result and I'm hoping the result will be like this:

Expected Output > ["   Value    of  a  is:     ", a, "   String  "]
# If there are spaces inside a string, it just has to be ignored, but I don't know how to do it

好了各位，这就是我遇到的问题，结论是：

如何解析字符串并通过'@'关键字将字符串中的每个字符串拆分，但是如果在字符串中的字符串中找到'@'则不会拆分？

Example: '"@ in a string inside a string" @ is_out_from_a_string'
The result should be: ['"@ in a string inside a string"', is_out_from_a_string]

在解析字符串时，如何在字符串中忽略字符串中的所有空格？

Example: '"    unnecessary      spaces  here      too" @ x @ y @ z "   this   one     too"'
The result should be: ['"    unnecessary      spaces  here      too"', x, y, z, '"   this   one     too"']

再次感谢您的辛勤工作，帮助我找到解决问题的方法，如果我做错了什么或误解了，请告诉我在哪里，我应该如何解决:)

谢谢:)

Answer 1

在谈论编程语言时，string.split() 和嵌套循环是不够的。编程语言通常将其分为两个步骤：分词器或词法分析器，以及解析器。 tokenizer 获取输入字符串（your-lang 中的代码）和 returns 表示关键字、标识符等的标记列表。在您的代码中，这是结果中的每个元素。

无论哪种方式，您都可能想要稍微重构一下您的代码。对于分词器，这里有一些 python-ish 伪代码：

yourcode = input
tokens = []
cursor = 0
while cursor < len(yourcode):
    yourcode = yourcode[cursor:-1] # remove previously scanned tokens
    match token regex from list of regexes
    if match == token:
        add add token of matched type to tokens
        cursor += len(matched string)
    elif match == whitespace:
        cursor += len(matched whitespace)
    else throw error invalid token

这使用光标在您的输入字符串中前进并提取标记，作为对您问题的直接回答。对于正则表达式列表，只需使用成对列表，其中每对包含一个正则表达式和一个描述令牌类型的字符串。

但是，对于第一个编程语言项目，构建手动分词器和解析器可能不是可行的方法，因为它很快就会变得非常复杂，不过一旦您熟悉了基本。我会考虑使用解析器生成器。我使用了一个名为 SLY 的 python 以及 PLY（SLY 的前身）并取得了良好的效果。解析器生成器采用 grammar，以特定格式描述您的语言，并输出可以解析您的语言的程序，这样您就可以担心语言本身的功能，而不是如何解析 text/code 输入。

在开始实施之前，还值得做一些更多的研究。具体来说，我建议阅读 Abstract Syntax Trees 和解析算法，特别是 recursive descent 如果您手动构建解析器，您将编写的内容，以及 LALR(1)(Lookahead Left-to-Right)这就是 SLY 生成的内容。

AST 是解析器的输出（解析器生成器为您所做的），用于解释或编译您的语言。它们是构建编程语言的基础，所以我将从这里开始。 This video explains syntax trees, and there are many python-specific videos on parsing as well. This series 还介绍了使用 SLY 在 python 中创建一种简单的语言。

编辑：关于字符串前@符号的具体解析，我建议对@符号使用一种标记类型，对字符串文字使用另一种标记类型。在您的解析器中，当解析器遇到 @ 符号时，您可以检查下一个标记是否为字符串文字。这将通过拆分正则表达式来降低复杂性，并且如果您实现的功能将来也使用 @ 或字符串文字，则还允许您重用标记。

解析带有某些要拆分的关键字的字符串（在字符串文字之外），但不在 Python 中的字符串文字内拆分

Parsing a string with some certain keywords to split (outside of string literals), but not splitted inside a string literals in Python

python

string

parsing