在 java 中使用 apache poi 解析 MS Word Doc 时如何知道图像或图片位置
How to know the Image or Picture Location while parsing MS Word Doc in java using apache poi
HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();
上面的语句给出了文档中所有图片的列表。我想知道图像将位于文档中的哪个 text/position 之后?
你看图片的方式不对,这就是你找不到任何位置的原因!
你需要做的是处理每个CharacterRun of the document in turn. Pass that to the PicturesTable,并检查角色运行是否有图片。如果有,从table中取回图片,你知道它在文档中的哪个位置,因为你有 运行 它来自
最简单的情况是:
PicturesSource pictures = new PicturesSource(document);
PicturesTable pictureTable = document.getPicturesTable();
Range r = document.getRange();
for(int i=0; i<r.numParagraphs(); i++) {
Paragraph p = r.getParagraph(i);
for(int j=0; j<p.numCharacterRuns(); j++) {
CharacterRun cr = p.getCharacterRun(j);
if (pictureTable.hasPicture(cr)) {
Picture picture = pictures.getFor(cr);
// Do something useful with the picture
}
}
}
您可以在 Apache Tika parser for Microsoft Word .doc 中找到执行此操作的一个很好的示例,它由 Apache POI
提供支持
你应该添加 PicturesSourceClass
public class 图片来源 {
private PicturesTable picturesTable;
private Set<Picture> output = new HashSet<Picture>();
private Map<Integer, Picture> lookup;
private List<Picture> nonU1based;
private List<Picture> all;
private int pn = 0;
public PicturesSource(HWPFDocument doc) {
picturesTable = doc.getPicturesTable();
all = picturesTable.getAllPictures();
lookup = new HashMap<Integer, Picture>();
for (Picture p : all) {
lookup.put(p.getStartOffset(), p);
}
nonU1based = new ArrayList<Picture>();
nonU1based.addAll(all);
Range r = doc.getRange();
for (int i = 0; i < r.numCharacterRuns(); i++) {
CharacterRun cr = r.getCharacterRun(i);
if (picturesTable.hasPicture(cr)) {
Picture p = getFor(cr);
int at = nonU1based.indexOf(p);
nonU1based.set(at, null);
}
}
}
private boolean hasPicture(CharacterRun cr) {
return picturesTable.hasPicture(cr);
}
private void recordOutput(Picture picture) {
output.add(picture);
}
private boolean hasOutput(Picture picture) {
return output.contains(picture);
}
private int pictureNumber(Picture picture) {
return all.indexOf(picture) + 1;
}
public Picture getFor(CharacterRun cr) {
return lookup.get(cr.getPicOffset());
}
private Picture nextUnclaimed() {
Picture p = null;
while (pn < nonU1based.size()) {
p = nonU1based.get(pn);
pn++;
if (p != null) return p;
}
return null;
}
}
HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();
上面的语句给出了文档中所有图片的列表。我想知道图像将位于文档中的哪个 text/position 之后?
你看图片的方式不对,这就是你找不到任何位置的原因!
你需要做的是处理每个CharacterRun of the document in turn. Pass that to the PicturesTable,并检查角色运行是否有图片。如果有,从table中取回图片,你知道它在文档中的哪个位置,因为你有 运行 它来自
最简单的情况是:
PicturesSource pictures = new PicturesSource(document);
PicturesTable pictureTable = document.getPicturesTable();
Range r = document.getRange();
for(int i=0; i<r.numParagraphs(); i++) {
Paragraph p = r.getParagraph(i);
for(int j=0; j<p.numCharacterRuns(); j++) {
CharacterRun cr = p.getCharacterRun(j);
if (pictureTable.hasPicture(cr)) {
Picture picture = pictures.getFor(cr);
// Do something useful with the picture
}
}
}
您可以在 Apache Tika parser for Microsoft Word .doc 中找到执行此操作的一个很好的示例,它由 Apache POI
提供支持你应该添加 PicturesSourceClass
public class 图片来源 {
private PicturesTable picturesTable;
private Set<Picture> output = new HashSet<Picture>();
private Map<Integer, Picture> lookup;
private List<Picture> nonU1based;
private List<Picture> all;
private int pn = 0;
public PicturesSource(HWPFDocument doc) {
picturesTable = doc.getPicturesTable();
all = picturesTable.getAllPictures();
lookup = new HashMap<Integer, Picture>();
for (Picture p : all) {
lookup.put(p.getStartOffset(), p);
}
nonU1based = new ArrayList<Picture>();
nonU1based.addAll(all);
Range r = doc.getRange();
for (int i = 0; i < r.numCharacterRuns(); i++) {
CharacterRun cr = r.getCharacterRun(i);
if (picturesTable.hasPicture(cr)) {
Picture p = getFor(cr);
int at = nonU1based.indexOf(p);
nonU1based.set(at, null);
}
}
}
private boolean hasPicture(CharacterRun cr) {
return picturesTable.hasPicture(cr);
}
private void recordOutput(Picture picture) {
output.add(picture);
}
private boolean hasOutput(Picture picture) {
return output.contains(picture);
}
private int pictureNumber(Picture picture) {
return all.indexOf(picture) + 1;
}
public Picture getFor(CharacterRun cr) {
return lookup.get(cr.getPicOffset());
}
private Picture nextUnclaimed() {
Picture p = null;
while (pn < nonU1based.size()) {
p = nonU1based.get(pn);
pn++;
if (p != null) return p;
}
return null;
}
}