如何使用 java 从 PDF 文件中读取两个单词之间的多行内容?
How to read multi-line content between two words from a PDF file using java?
我有一个要求,我必须从单词 "IN:" 之后和单词 "OUT:" 之前的 PDF 文件中获取数据,并且整个文件中有很多这样的事件。
问题是也可以多行,而且格式没有定义。
我什至尝试过设置一些条件,比如以特定字符开始或结束,但那样我就不得不写太多条件,而且这种格式确实存在于 "OUT:" 字之后已提取。
请告诉我如何解决这个问题。
以下是示例数据格式:
格式 1:
IN: {
"abc": "valueabc",
"def": "valuedef",
"ghi":
[
{"jkl": valuejkl, "mno": valuemno, "pqr":
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"}
],
"id": "1"
}
OUT: {"abc": "valueabc", "id": "1", "def": {}}
格式 2 :
IN: {"abc": "valueabc", "def": "valuedef", "id": "1"}
OUT: {"abc": "valueabc", "id": "1", "ghi": "valueghi"}
格式 3 :
IN: {"abc": "valueabc", "def": "valuedef", "jkl":
["valuejkl"], "id": "1"}
OUT: {"abc": "valueabc", "id": "1", "ghi": {}}
下面是我试过的解决方案代码的核心逻辑,if语句中还有单独的数据需要取,之后就是"IN:"之后和[之前取数据的逻辑=34=]
for(String line:lines)
{
String pattern = "^[0-9]+[\.][0-9]+[\.][0-9]+[\.].*";
boolean matches = Pattern.matches(pattern, line);
if(matches)
{
String subString1 = line.split("\.")[3].trim();
String subString2 = line.split("\.")[4].trim();
String finalString = subString1+"."+subString2+",";
System.out.println();
System.out.print(finalString);
}
else if(line.startsWith("IN:"))
{
String finalString = line.substring(3).trim();
System.out.print(finalString);
}
else if(!(line.startsWith("IN:")||line.startsWith("OUT:"))&&((line.trim().length()>1)&&(line.endsWith("}"))))
{
String finalString = line.trim();
System.out.print(finalString);
}
else if(!(line.startsWith("IN:")||line.startsWith("OUT:"))&&((line.trim().length()>1)&&(line.startsWith("\""))))
{
String finalString = line.trim();
System.out.print(finalString);
}
else
{
continue;
}
}
这个怎么样?如果您想要 IN:
和 OUT:
之间的值,
你能试试这个代码吗?
StringBuilder sb = new StringBuilder();
boolean targetFound = false;
for (String line : lines) {
if (line.startsWith("IN:")) {
line = line.replace("IN:", "");
targetFound = false;
} else if (line.startsWith("OUT:")) {
targetFound = true;
}
if (targetFound && !line.equals("OUT:")) {
// Print
System.out.println(sb.toString());
sb.setLength(0);
} else {
sb.append(line.trim());
}
}
输入文本:
IN: {
"abc": "valueabc",
"def": "valuedef",
"ghi":
[
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"}
],
"id": "1"
}
OUT: {"abc": "valueabc", "~"}
结果:
{"abc": "valueabc","def": "valuedef","ghi":["valuepqr"},{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":"valuepqr"}],"id": "1"}
我有一个要求,我必须从单词 "IN:" 之后和单词 "OUT:" 之前的 PDF 文件中获取数据,并且整个文件中有很多这样的事件。
问题是也可以多行,而且格式没有定义。
我什至尝试过设置一些条件,比如以特定字符开始或结束,但那样我就不得不写太多条件,而且这种格式确实存在于 "OUT:" 字之后已提取。
请告诉我如何解决这个问题。
以下是示例数据格式:
格式 1:
IN: {
"abc": "valueabc",
"def": "valuedef",
"ghi":
[
{"jkl": valuejkl, "mno": valuemno, "pqr":
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"}
],
"id": "1"
}
OUT: {"abc": "valueabc", "id": "1", "def": {}}
格式 2 :
IN: {"abc": "valueabc", "def": "valuedef", "id": "1"}
OUT: {"abc": "valueabc", "id": "1", "ghi": "valueghi"}
格式 3 :
IN: {"abc": "valueabc", "def": "valuedef", "jkl":
["valuejkl"], "id": "1"}
OUT: {"abc": "valueabc", "id": "1", "ghi": {}}
下面是我试过的解决方案代码的核心逻辑,if语句中还有单独的数据需要取,之后就是"IN:"之后和[之前取数据的逻辑=34=]
for(String line:lines)
{
String pattern = "^[0-9]+[\.][0-9]+[\.][0-9]+[\.].*";
boolean matches = Pattern.matches(pattern, line);
if(matches)
{
String subString1 = line.split("\.")[3].trim();
String subString2 = line.split("\.")[4].trim();
String finalString = subString1+"."+subString2+",";
System.out.println();
System.out.print(finalString);
}
else if(line.startsWith("IN:"))
{
String finalString = line.substring(3).trim();
System.out.print(finalString);
}
else if(!(line.startsWith("IN:")||line.startsWith("OUT:"))&&((line.trim().length()>1)&&(line.endsWith("}"))))
{
String finalString = line.trim();
System.out.print(finalString);
}
else if(!(line.startsWith("IN:")||line.startsWith("OUT:"))&&((line.trim().length()>1)&&(line.startsWith("\""))))
{
String finalString = line.trim();
System.out.print(finalString);
}
else
{
continue;
}
}
这个怎么样?如果您想要 IN:
和 OUT:
之间的值,
你能试试这个代码吗?
StringBuilder sb = new StringBuilder();
boolean targetFound = false;
for (String line : lines) {
if (line.startsWith("IN:")) {
line = line.replace("IN:", "");
targetFound = false;
} else if (line.startsWith("OUT:")) {
targetFound = true;
}
if (targetFound && !line.equals("OUT:")) {
// Print
System.out.println(sb.toString());
sb.setLength(0);
} else {
sb.append(line.trim());
}
}
输入文本:
IN: {
"abc": "valueabc",
"def": "valuedef",
"ghi":
[
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"}
],
"id": "1"
}
OUT: {"abc": "valueabc", "~"}
结果:
{"abc": "valueabc","def": "valuedef","ghi":["valuepqr"},{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":"valuepqr"}],"id": "1"}