自定义 Java 正则表达式:匹配开头和结尾
Custom Java Regex: Match starting with and ending with
几天来我一直在为这个问题苦苦挣扎,我想知道也许有人可以帮助我解决这个问题。
我想要完成的是处理一个包含一组问题和答案的文本文件。文件(.doc 或 .docx)的内容如下所示:
Document Name
1. Question one:
a. Answer one to question one
b. Answer two to question one
c. Answer three to question one
2. Question two:
a. Answer one to question two
c. Answer two to question two
e. Answer three to question two
目前我尝试过的是:
像这样通过 Apache POI 读取文档的内容:
fis = new FileInputStream(new File(FilePath));
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor extract = new XWPFWordExtractor(doc);
String extractorText = extract.getText();
所以,到目前为止,我已经掌握了文档的内容。接下来,我尝试创建一个正则表达式模式来匹配问题开头的数字和点 (1., 12.) 并继续,直到它通过以下方式匹配冒号:
Pattern regexPattern = Pattern.compile("^(\d|\d\d)+\.[^:]+:\s*$", Pattern.MULTILINE);
Matcher regexMatcher = regexPattern.matcher(extractorText);
但是,当我尝试遍历结果集时,我找不到任何问题文本:
while (regexMatcher.find()) {
System.out.println("Found");
for (int i = 0; i < regexMatcher.groupCount() - 2; i += 2) {
map.put(regexMatcher.group(i + 1), regexMatcher.group(i + 2));
System.out.println("#" + regexMatcher.group(i + 1) + " >> " + regexMatcher.group(i + 2));
}
}
由于我是 Java 的新手,所以我不确定哪里出错了,希望有人能帮助我。
此外,如果有人有更好的方法来创建包含问题和相关答案的地图,我们将不胜感激。
提前谢谢你。
编辑:我正在尝试获取类似 Map 的内容,其中包含键(问题文本)和另一个字符串列表,表示与该问题相关的一组答案,例如:
Map<String, List<String>> desiredResult = new HashMap<>();
desiredResult.entrySet().forEach((entry) -> {
String questionText = entry.getKey();
List<String> answersList = entry.getValue();
System.out.println("Now at question: " + questionText);
answersList.forEach((answerText) -> {
System.out.println("Now at answer: " + answerText);
});
});
这将生成以下输出:
Now at question: 1. Question one:
Now at answer: a. Answer one to question one
Now at answer: b. Answer two to question one
Now at answer: c. Answer three to question one
经过一番思考,我想出了一个答案。通过用新行拆分文档,我们得到一个包含所有行的数组。
然后迭代该数组时,我们只需要确定一行是问题还是答案。我已经用 2 个不同的正则表达式做到了:
问题:
\d{1,2}\..+
答案:
[a-z]\..+
据此我们可以决定是否开始了一个新问题,或者是否需要将该行添加到结果中。
代码如下:
// the read document
String document = "Document Name\n" +
"1. Question one:\n" +
"a. Answer one to question one\n" +
"b. Answer two to question one\n" +
"c. Answer three to question one\n" +
"2. Question two:\n" +
"a. Answer one to question two\n" +
"c. Answer two to question two\n" +
"e. Answer three to question two";
// splitting by lines
String[] lines = document.split("\r?\n");
// the regex patterns
Pattern questionPattern = Pattern.compile("\d{1,2}\..+");
Pattern answerPattern = Pattern.compile("[a-z]\..+");
// intermediate holding variable
String lastLine = null;
// the result
Map<String, List<String>> result = new HashMap<>();
for(int lineNumber = 0; lineNumber < lines.length; lineNumber++){
String line = lines[lineNumber];
if(questionPattern.matcher(line).matches()){
result.put(line, new LinkedList<>());
lastLine = line;
} else if(answerPattern.matcher(line).matches()){
result.get(lastLine).add(line);
} else{
System.out.printf("Line %s is not a question nor an answer!%n", lineNumber);
}
}
几天来我一直在为这个问题苦苦挣扎,我想知道也许有人可以帮助我解决这个问题。
我想要完成的是处理一个包含一组问题和答案的文本文件。文件(.doc 或 .docx)的内容如下所示:
Document Name
1. Question one:
a. Answer one to question one
b. Answer two to question one
c. Answer three to question one
2. Question two:
a. Answer one to question two
c. Answer two to question two
e. Answer three to question two
目前我尝试过的是:
像这样通过 Apache POI 读取文档的内容:
fis = new FileInputStream(new File(FilePath));
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor extract = new XWPFWordExtractor(doc);
String extractorText = extract.getText();
所以,到目前为止,我已经掌握了文档的内容。接下来,我尝试创建一个正则表达式模式来匹配问题开头的数字和点 (1., 12.) 并继续,直到它通过以下方式匹配冒号:
Pattern regexPattern = Pattern.compile("^(\d|\d\d)+\.[^:]+:\s*$", Pattern.MULTILINE);
Matcher regexMatcher = regexPattern.matcher(extractorText);
但是,当我尝试遍历结果集时,我找不到任何问题文本:
while (regexMatcher.find()) {
System.out.println("Found");
for (int i = 0; i < regexMatcher.groupCount() - 2; i += 2) {
map.put(regexMatcher.group(i + 1), regexMatcher.group(i + 2));
System.out.println("#" + regexMatcher.group(i + 1) + " >> " + regexMatcher.group(i + 2));
}
}
由于我是 Java 的新手,所以我不确定哪里出错了,希望有人能帮助我。
此外,如果有人有更好的方法来创建包含问题和相关答案的地图,我们将不胜感激。
提前谢谢你。
编辑:我正在尝试获取类似 Map 的内容,其中包含键(问题文本)和另一个字符串列表,表示与该问题相关的一组答案,例如:
Map<String, List<String>> desiredResult = new HashMap<>();
desiredResult.entrySet().forEach((entry) -> {
String questionText = entry.getKey();
List<String> answersList = entry.getValue();
System.out.println("Now at question: " + questionText);
answersList.forEach((answerText) -> {
System.out.println("Now at answer: " + answerText);
});
});
这将生成以下输出:
Now at question: 1. Question one:
Now at answer: a. Answer one to question one
Now at answer: b. Answer two to question one
Now at answer: c. Answer three to question one
经过一番思考,我想出了一个答案。通过用新行拆分文档,我们得到一个包含所有行的数组。
然后迭代该数组时,我们只需要确定一行是问题还是答案。我已经用 2 个不同的正则表达式做到了:
问题:
\d{1,2}\..+
答案:
[a-z]\..+
据此我们可以决定是否开始了一个新问题,或者是否需要将该行添加到结果中。
代码如下:
// the read document
String document = "Document Name\n" +
"1. Question one:\n" +
"a. Answer one to question one\n" +
"b. Answer two to question one\n" +
"c. Answer three to question one\n" +
"2. Question two:\n" +
"a. Answer one to question two\n" +
"c. Answer two to question two\n" +
"e. Answer three to question two";
// splitting by lines
String[] lines = document.split("\r?\n");
// the regex patterns
Pattern questionPattern = Pattern.compile("\d{1,2}\..+");
Pattern answerPattern = Pattern.compile("[a-z]\..+");
// intermediate holding variable
String lastLine = null;
// the result
Map<String, List<String>> result = new HashMap<>();
for(int lineNumber = 0; lineNumber < lines.length; lineNumber++){
String line = lines[lineNumber];
if(questionPattern.matcher(line).matches()){
result.put(line, new LinkedList<>());
lastLine = line;
} else if(answerPattern.matcher(line).matches()){
result.get(lastLine).add(line);
} else{
System.out.printf("Line %s is not a question nor an answer!%n", lineNumber);
}
}