Java 和正则表达式词法分析器

Question

我正在尝试在 Java 中使用正则表达式为我正在制作的自定义降价“语言”制作某种 Lexer，这是我第一次使用这些东西，所以在一些事情上有点迷茫.
其中可能的语法示例是：
Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!
我能够捕获一些东西，例如我正在使用 (?<hex><#\w+>) 捕获“十六进制”和 (?<action>\[[^]]*]$[^]]*$) 来获取整个“动作”块。
我的问题是能够将它们全部捕获在一起，比如如何将它们结合起来。例如，词法分析器需要输出如下内容：

TEXT - Some
HEX - <#000000>
TEXT - *text*
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT - and **finally** some more
HEX - <#000>
TEXT - text!

我稍后会处理粗体和斜体。
想要一些关于如何将它们结合起来的建议！

Answer 1

您可以像这样使用 Regex- 捕获组 来实现此目的 ^(.*?) (?<hex1><#\w+>)(\*[^*]*\*) (?<action>\[[^]]*]$[^]]*$) (.*?) (?<hex2><#\w+>)(.*)$ 为了更好地理解，请参阅此 Click here

Answer 2

一个选项可能是使用与每个单独部分匹配的交替，并且对于文本部分使用例如 character class [\w!* ]+

在Java中，您可以检查捕获组的名称。

(?<hex><#\w+>)|(?<action>\[[^]]*]\([^]]*\))|(?<text>[\w!* ]+)

说明

(?<hex><#\w+>) 捕获组 hex，匹配 # 和 1+ 个单词字符
| 或
(?<action> 捕获组 action
- \[[^]]*]$[^]]*$ 匹配 [...] 后跟 (...)
) 关闭群组
| 或
(?<text>[\w!* ]+) 捕获组 text，匹配字符 class

Regex demo | Java demo

示例代码：

String regex = "(?<hex><#\w+>)|(?<action>\[[^]]*]\([^]]*\))|(?<text>[\w!* ]+)";
String string = "Some <#000000>*text* [<#ffffff>Some more](action: Other <#gradient>text) and **finally** some more <#000>text!";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);

while (matcher.find()) {
    if (matcher.group("hex") != null) {
        System.out.println("HEX - " + matcher.group("hex"));    
    }
    if (matcher.group("text") != null) {
        System.out.println("TEXT - " + matcher.group("text"));  
    }
    if (matcher.group("action") != null) {
        System.out.println("ACTION - " + matcher.group("action"));  
    }
}

输出

TEXT - Some 
HEX - <#000000>
TEXT - *text* 
ACTION - [<#ffffff>Some more](action: Other <#gradient>text)
TEXT -  and **finally** some more 
HEX - <#000>
TEXT - text!

Java 和正则表达式词法分析器

Java and regex lexer

java

regex

lexer