从 PDF 中解析对象,具有字节流的对象由于某种原因被忽略了吗?
Parsing objects out of PDF, objects with byte streams are ignored for some reason?
我当前的作业包括从 pdf 文件中取出所有对象,然后使用解析出的对象。但是我注意到一个问题,我的代码跳过了一些流对象。
我很困惑,希望有人能帮助指出这里出了什么问题。
这里是主要的解析代码
void parseRawPDFFile() {
//Transform the bytes obtained from the file into a byte character sequence. This byte character sequence
//object is what allows us to use it in regex.
ByteCharSequence byteCharSequence = new ByteCharSequence(bytesFromFile.toByteArray());
byteCharSequence.getStringFromData();
Pattern pattern = Pattern.compile(SINGLE_OBJECT_REGEX);
Matcher matcher = pattern.matcher(byteCharSequence);
//While we have a match (apparently only one match exists at a time) keep looping over the list.
//When a match is found, get the starting and ending indices and manually cut these out char by char
//and assemble them into a new "ByteArrayOutputStream".
int counterOfDoom = 1;
while (matcher.find() ) {
for (int i = 0; i < matcher.groupCount(); i++) {
ByteArrayOutputStream cutOutArray = cutOutByteArrayOutputStreamFromOriginal(matcher.start(), matcher.end());
System.out.println("----------------------------------------------------");
System.out.println(cutOutArray);
//At this point we have cut out the object and can now send it for processing.
createPDFObject(cutOutArray);
System.out.println(counterOfDoom);
System.out.println("----------------------------------------------------");
counterOfDoom++;
}
}
}
这是 ByteCharSequence 的代码
(此代码的核心来源:http://blog.sarah-happy.ca/2013/01/java-regular-expression-on-byte-array.html)
public class ByteCharSequence implements CharSequence {
private final byte[] data;
private final int length;
private final int offset;
public ByteCharSequence(byte[] data) {
this(data, 0, data.length);
}
public ByteCharSequence(byte[] data, int offset, int length) {
this.data = data;
this.offset = offset;
this.length = length;
}
@Override
public int length() {
return this.length;
}
@Override
public char charAt(int index) {
return (char) (data[offset + index] & 0xff);
}
@Override
public CharSequence subSequence(int start, int end) {
return new ByteCharSequence(data, offset + start, end - start);
}
/**
* Get the string from the ByteCharSequence data.
* @return
*/
public String getStringFromData() {
//Load it into the method I know works to convert it to a string... Optimized? Probably not at all.
//But it works...
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
for (byte individualByte: data
) {
byteArrayOutputStream.write(individualByte);
}
return byteArrayOutputStream.toString();
}
}
我目前正在处理的pdf数据:
10 0 obj
<</Filter/FlateDecode/Length 1040>>stream
(Bunch of bytes)
endstream
endobj
12 0 obj
<</Filter/FlateDecode/Length 2574/N 3>>stream
(Bunch of bytes)
endstream
endobj
我试图查看的一些信息。
1:据我了解,数据结构中可以容纳多少应该没有限制。所以大小应该不是问题????
将 DOTALL 标志添加到模式编译调用中,以便您的模式匹配换行符 =)
我当前的作业包括从 pdf 文件中取出所有对象,然后使用解析出的对象。但是我注意到一个问题,我的代码跳过了一些流对象。
我很困惑,希望有人能帮助指出这里出了什么问题。
这里是主要的解析代码
void parseRawPDFFile() {
//Transform the bytes obtained from the file into a byte character sequence. This byte character sequence
//object is what allows us to use it in regex.
ByteCharSequence byteCharSequence = new ByteCharSequence(bytesFromFile.toByteArray());
byteCharSequence.getStringFromData();
Pattern pattern = Pattern.compile(SINGLE_OBJECT_REGEX);
Matcher matcher = pattern.matcher(byteCharSequence);
//While we have a match (apparently only one match exists at a time) keep looping over the list.
//When a match is found, get the starting and ending indices and manually cut these out char by char
//and assemble them into a new "ByteArrayOutputStream".
int counterOfDoom = 1;
while (matcher.find() ) {
for (int i = 0; i < matcher.groupCount(); i++) {
ByteArrayOutputStream cutOutArray = cutOutByteArrayOutputStreamFromOriginal(matcher.start(), matcher.end());
System.out.println("----------------------------------------------------");
System.out.println(cutOutArray);
//At this point we have cut out the object and can now send it for processing.
createPDFObject(cutOutArray);
System.out.println(counterOfDoom);
System.out.println("----------------------------------------------------");
counterOfDoom++;
}
}
}
这是 ByteCharSequence 的代码 (此代码的核心来源:http://blog.sarah-happy.ca/2013/01/java-regular-expression-on-byte-array.html)
public class ByteCharSequence implements CharSequence {
private final byte[] data;
private final int length;
private final int offset;
public ByteCharSequence(byte[] data) {
this(data, 0, data.length);
}
public ByteCharSequence(byte[] data, int offset, int length) {
this.data = data;
this.offset = offset;
this.length = length;
}
@Override
public int length() {
return this.length;
}
@Override
public char charAt(int index) {
return (char) (data[offset + index] & 0xff);
}
@Override
public CharSequence subSequence(int start, int end) {
return new ByteCharSequence(data, offset + start, end - start);
}
/**
* Get the string from the ByteCharSequence data.
* @return
*/
public String getStringFromData() {
//Load it into the method I know works to convert it to a string... Optimized? Probably not at all.
//But it works...
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
for (byte individualByte: data
) {
byteArrayOutputStream.write(individualByte);
}
return byteArrayOutputStream.toString();
}
}
我目前正在处理的pdf数据:
10 0 obj
<</Filter/FlateDecode/Length 1040>>stream
(Bunch of bytes)
endstream
endobj
12 0 obj
<</Filter/FlateDecode/Length 2574/N 3>>stream
(Bunch of bytes)
endstream
endobj
我试图查看的一些信息。
1:据我了解,数据结构中可以容纳多少应该没有限制。所以大小应该不是问题????
将 DOTALL 标志添加到模式编译调用中,以便您的模式匹配换行符 =)