如何使用 Scanner.useDelimiter() 匹配两个相邻的字符后跟一个单词?

How to use Scanner.useDelimiter() to match two characters next to each other followed by a word?

我正在尝试解析具有一般结构的普通 .txt 文件

[[Title]]
CATEGORIES: text, text, text
some text etc...
[[Next Title]]
CATEGORIES: text, text, text
Next other text etc ...

在我的代码中我使用了这个模式

Scanner inputScanner = new Scanner(fileEntry)
inputScanner.useDelimiter("\]\]|\[\[");  
while (inputScanner.hasNext()) {
   // Get title of wiki article and contents
   String wikiName = inputScanner.next();
   String wikiContents = inputScanner.next();
}

但它也捕捉像

这样的项目
"[some text [ some other text ] some more text ]"
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s"
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]"
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]"
"observed is not some nonphysical world of [[consciousness]], mind, or mental life "

我希望扫描器在看到时划界

'[[' or ']] CATEGORIES'

但不确定我该怎么做,因为我不太擅长模式或正则表达式。 任何人都可以确定一种可能有效的模式吗?我尝试查看其他定界符问题和 javadoc,但很难将它们应用到我的问题中。 感谢您的宝贵时间以及您能提供的任何帮助!

为了正确匹配标题,我们可以在正则表达式中使用positive lookahead
\[\[(?=.*]]\nCATEGORIES:)|]]\n(?=CATEGORIES:)

解释:

  • 匹配 [[ 后跟任何字符序列和 CATEGORIES 字符串。使用正面前瞻,因此只有 [[ 匹配。
  • 类似地,匹配 ]] 后跟 CATEGORIES 字符串。

更新的代码段:

String text = "[[title1]] \n" +
        "CATEGORIES: [some text [ some other text ] some more text ]\n" +
        "[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s\n" +
        "[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]\n" +
        "[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]\n" +
        "observed is not some nonphysical world of [[consciousness]], mind, or mental life\n" +
        "[[title2]]\n" +
        "CATEGORIES: [[some more text]]";


Scanner inputScanner = new Scanner(text);
inputScanner.useDelimiter("\[\[(?=.*]]\s*CATEGORIES:)|]]\s*\n(?=\s*CATEGORIES:)");
while (inputScanner.hasNext()) {
    String wikiName = inputScanner.next();
    String wikiContents = inputScanner.next();
    System.out.printf("Name:%s\nContents:%s\n\n", wikiName, wikiContents);
}

输出:

Name:title1
Contents:CATEGORIES: [some text [ some other text ] some more text ]
[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s
[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]
[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]
observed is not some nonphysical world of [[consciousness]], mind, or mental life


Name:title2
Contents:CATEGORIES: [[some more text]]