如何使用 Scanner.useDelimiter() 匹配两个相邻的字符后跟一个单词?
How to use Scanner.useDelimiter() to match two characters next to each other followed by a word?
我正在尝试解析具有一般结构的普通 .txt 文件
[[Title]]
CATEGORIES: text, text, text
some text etc...
[[Next Title]]
CATEGORIES: text, text, text
Next other text etc ...
在我的代码中我使用了这个模式
Scanner inputScanner = new Scanner(fileEntry)
inputScanner.useDelimiter("\]\]|\[\[");
while (inputScanner.hasNext()) {
// Get title of wiki article and contents
String wikiName = inputScanner.next();
String wikiContents = inputScanner.next();
}
但它也捕捉像
这样的项目
"[some text [ some other text ] some more text ]"
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s"
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]"
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]"
"observed is not some nonphysical world of [[consciousness]], mind, or mental life "
我希望扫描器在看到时划界
'[[' or ']] CATEGORIES'
但不确定我该怎么做,因为我不太擅长模式或正则表达式。
任何人都可以确定一种可能有效的模式吗?我尝试查看其他定界符问题和 javadoc,但很难将它们应用到我的问题中。
感谢您的宝贵时间以及您能提供的任何帮助!
为了正确匹配标题,我们可以在正则表达式中使用positive lookahead
:
\[\[(?=.*]]\nCATEGORIES:)|]]\n(?=CATEGORIES:)
解释:
- 匹配
[[
后跟任何字符序列和 CATEGORIES
字符串。使用正面前瞻,因此只有 [[
匹配。
- 类似地,匹配
]]
后跟 CATEGORIES
字符串。
更新的代码段:
String text = "[[title1]] \n" +
"CATEGORIES: [some text [ some other text ] some more text ]\n" +
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s\n" +
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]\n" +
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]\n" +
"observed is not some nonphysical world of [[consciousness]], mind, or mental life\n" +
"[[title2]]\n" +
"CATEGORIES: [[some more text]]";
Scanner inputScanner = new Scanner(text);
inputScanner.useDelimiter("\[\[(?=.*]]\s*CATEGORIES:)|]]\s*\n(?=\s*CATEGORIES:)");
while (inputScanner.hasNext()) {
String wikiName = inputScanner.next();
String wikiContents = inputScanner.next();
System.out.printf("Name:%s\nContents:%s\n\n", wikiName, wikiContents);
}
输出:
Name:title1
Contents:CATEGORIES: [some text [ some other text ] some more text ]
[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s
[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]
[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]
observed is not some nonphysical world of [[consciousness]], mind, or mental life
Name:title2
Contents:CATEGORIES: [[some more text]]
我正在尝试解析具有一般结构的普通 .txt 文件
[[Title]]
CATEGORIES: text, text, text
some text etc...
[[Next Title]]
CATEGORIES: text, text, text
Next other text etc ...
在我的代码中我使用了这个模式
Scanner inputScanner = new Scanner(fileEntry)
inputScanner.useDelimiter("\]\]|\[\[");
while (inputScanner.hasNext()) {
// Get title of wiki article and contents
String wikiName = inputScanner.next();
String wikiContents = inputScanner.next();
}
但它也捕捉像
这样的项目"[some text [ some other text ] some more text ]"
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s"
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]"
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]"
"observed is not some nonphysical world of [[consciousness]], mind, or mental life "
我希望扫描器在看到时划界
'[[' or ']] CATEGORIES'
但不确定我该怎么做,因为我不太擅长模式或正则表达式。 任何人都可以确定一种可能有效的模式吗?我尝试查看其他定界符问题和 javadoc,但很难将它们应用到我的问题中。 感谢您的宝贵时间以及您能提供的任何帮助!
为了正确匹配标题,我们可以在正则表达式中使用positive lookahead
:
\[\[(?=.*]]\nCATEGORIES:)|]]\n(?=CATEGORIES:)
解释:
- 匹配
[[
后跟任何字符序列和CATEGORIES
字符串。使用正面前瞻,因此只有[[
匹配。 - 类似地,匹配
]]
后跟CATEGORIES
字符串。
更新的代码段:
String text = "[[title1]] \n" +
"CATEGORIES: [some text [ some other text ] some more text ]\n" +
"[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s\n" +
"[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]\n" +
"[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]\n" +
"observed is not some nonphysical world of [[consciousness]], mind, or mental life\n" +
"[[title2]]\n" +
"CATEGORIES: [[some more text]]";
Scanner inputScanner = new Scanner(text);
inputScanner.useDelimiter("\[\[(?=.*]]\s*CATEGORIES:)|]]\s*\n(?=\s*CATEGORIES:)");
while (inputScanner.hasNext()) {
String wikiName = inputScanner.next();
String wikiContents = inputScanner.next();
System.out.printf("Name:%s\nContents:%s\n\n", wikiName, wikiContents);
}
输出:
Name:title1
Contents:CATEGORIES: [some text [ some other text ] some more text ]
[[Vertebrate trachea|trachea]]s from human stem cells. Several [[artificial urinary bladder]]s
[[Image:Bohr-atom-PAR.svg|thumb|right|310px|The Rutherford–Bohr model of the hydrogen atom ([tpl]nowrap|Z [tpl]=[/tpl] 1[/tpl]) or a hydrogen-like ion ([tpl]nowrap|Z > 1[/tpl]), results in a photon of wavelength 656 nm (red light).]]
[[File:Gettysburg Campaign.png|thumb|350px|Gettysburg Campaign (through July 3); cavalry movements shown with dashed lines. [tpl]legend|#ff0000|Confederate[/tpl]]]
observed is not some nonphysical world of [[consciousness]], mind, or mental life
Name:title2
Contents:CATEGORIES: [[some more text]]