在 java 中使用 apache poi 解析 MS Word Doc 时如何知道图像或图片位置

Question

HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();

上面的语句给出了文档中所有图片的列表。我想知道图像将位于文档中的哪个 text/position 之后？

Answer 1

你看图片的方式不对，这就是你找不到任何位置的原因！

你需要做的是处理每个CharacterRun of the document in turn. Pass that to the PicturesTable，并检查角色运行是否有图片。如果有，从table中取回图片，你知道它在文档中的哪个位置，因为你有运行它来自

最简单的情况是：

PicturesSource pictures = new PicturesSource(document);
PicturesTable pictureTable = document.getPicturesTable();

Range r = document.getRange();
for(int i=0; i<r.numParagraphs(); i++) {
    Paragraph p = r.getParagraph(i);
    for(int j=0; j<p.numCharacterRuns(); j++) {
      CharacterRun cr = p.getCharacterRun(j);
      if (pictureTable.hasPicture(cr)) {
         Picture picture = pictures.getFor(cr);
         // Do something useful with the picture
      }
    }
}

您可以在 Apache Tika parser for Microsoft Word .doc 中找到执行此操作的一个很好的示例，它由 Apache POI

提供支持

Answer 2

你应该添加 PicturesSourceClass

public class 图片来源 {

private PicturesTable picturesTable;
private Set<Picture> output = new HashSet<Picture>();
private Map<Integer, Picture> lookup;
private List<Picture> nonU1based;
private List<Picture> all;
private int pn = 0;

public PicturesSource(HWPFDocument doc) {
    picturesTable = doc.getPicturesTable();
    all = picturesTable.getAllPictures();


    lookup = new HashMap<Integer, Picture>();
    for (Picture p : all) {
        lookup.put(p.getStartOffset(), p);
    }


    nonU1based = new ArrayList<Picture>();
    nonU1based.addAll(all);
    Range r = doc.getRange();
    for (int i = 0; i < r.numCharacterRuns(); i++) {
        CharacterRun cr = r.getCharacterRun(i);
        if (picturesTable.hasPicture(cr)) {
            Picture p = getFor(cr);
            int at = nonU1based.indexOf(p);
            nonU1based.set(at, null);
        }
    }
}


private boolean hasPicture(CharacterRun cr) {
    return picturesTable.hasPicture(cr);
}

private void recordOutput(Picture picture) {
    output.add(picture);
}

private boolean hasOutput(Picture picture) {
    return output.contains(picture);
}

private int pictureNumber(Picture picture) {
    return all.indexOf(picture) + 1;
}

public Picture getFor(CharacterRun cr) {
    return lookup.get(cr.getPicOffset());
}


private Picture nextUnclaimed() {
    Picture p = null;
    while (pn < nonU1based.size()) {
        p = nonU1based.get(pn);
        pn++;
        if (p != null) return p;
    }
    return null;
}

}

在 java 中使用 apache poi 解析 MS Word Doc 时如何知道图像或图片位置

How to know the Image or Picture Location while parsing MS Word Doc in java using apache poi

java

apache

ms-word

apache-poi