使用 PDFBox 处理许多 unicode 字符

Question

我正在编写一个 Java 函数，该函数将字符串作为参数并使用 PDFBox 生成 PDF 作为输出。

只要我使用拉丁字符，一切正常。但是，我事先并不知道输入的内容是什么，可能是一些英文以及中文或日文字符。

对于非拉丁字符，这是我得到的错误：

Exception in thread "main" java.lang.IllegalArgumentException: U+3053 ('kohiragana') is not available in this font Helvetica encoding: WinAnsiEncoding
at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:426)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:324)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:509)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:471)
at com.mylib.pdf.PDFBuilder.generatePdfFromString(PDFBuilder.java:122)
at com.mylib.pdf.PDFBuilder.main(PDFBuilder.java:111)

如果我理解正确的话，我必须为日文使用一种特定的字体，为中文等使用另一种字体，因为我使用的字体 (Helvetiva) 无法处理所有必需的 unicode 字符。

我还可以使用一种字体来处理所有这些 unicode 字符，例如 Arial Unicode。但是这个字体有一个特定的许可证，所以我不能使用它，我还没有找到另一个。

我发现了一些想要解决这个问题的项目，比如 Google NOTO project。然而，这个项目提供了多个字体文件。所以我必须在运行时根据我的输入选择要加载的正确文件。

所以我面临两个选择，其中一个我不知道如何正确实施：

继续寻找可以处理几乎所有 unicode 字符的字体（我拼命寻找的圣杯在哪里？！）
尝试检测使用的语言和select 依赖于它的字体。尽管我（还）不知道该怎么做，但我不认为它是一个干净的实现，因为输入和字体文件之间的映射将被硬编码，这意味着我将不得不硬编码所有可能的映射。
还有其他解决方案吗？
我完全偏离轨道了吗？

在此先感谢您的帮助和指导！

这是我用来生成 PDF 的代码：

public static void main(String args[]) throws IOException {
    String latinText = "This is latin text";
    String japaneseText = "これは日本語です";

    // This works good
    generatePdfFromString(latinText);

    // This generate an error
    generatePdfFromString(japaneseText);
}

private static OutputStream generatePdfFromString(String content) throws IOException {
    PDPage page = new PDPage();

    try (PDDocument doc = new PDDocument();
         PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
        doc.addPage(page);
        contentStream.setFont(PDType1Font.HELVETICA, 12);

        // Or load a specific font from a file
        // contentStream.setFont(PDType0Font.load(this.doc, new File("/fontPath.ttf")), 12);

        contentStream.beginText();
        contentStream.showText(content);
        contentStream.endText();
        contentStream.close();
        OutputStream os = new ByteArrayOutputStream();
        doc.save(os);
        return os;
    }
}

Answer 1

比等待字体或猜测文本语言更好的解决方案是拥有多种字体并在逐个字形的基础上选择正确的字体。

您已经找到了 Google Noto Fonts，这是完成此任务的良好字体基础集合。

不幸的是，Google 仅将 Noto CJK 字体发布为 OpenType 字体 (.otf)，而不是 TrueType 字体 (.ttf)，这一政策不太可能改变，请参见。 the Noto fonts issue 249 and others. On the other hand PDFBox does not support OpenType fonts and isn't actively working on OpenType support either, cf. PDFBOX-2482.

因此，必须以某种方式将 OpenType 字体转换为 TrueType。 djmilch在他的博客postFREE FONT NOTO SANS CJK IN TTF.

分享的文件我就随便拿了

每个字符的字体选择

所以您本质上需要一种方法来逐个字符地检查您的文本并将其分解成可以使用相同字体绘制的块。

不幸的是，我没有看到更好的方法来询问 PDFBox PDFont 它是否知道给定字符的字形，而不是实际尝试对字符进行编码并考虑 IllegalArgumentException 一个“没有。

因此，我使用以下助手 class TextWithFont 和方法 fontify:

实现了该功能

class TextWithFont {
    final String text;
    final PDFont font;

    TextWithFont(String text, PDFont font) {
        this.text = text;
        this.font = font;
    }

    public void show(PDPageContentStream canvas, float fontSize) throws IOException {
        canvas.setFont(font, fontSize);
        canvas.showText(text);
    }
}

(AddTextWithDynamicFonts内class)

List<TextWithFont> fontify(List<PDFont> fonts, String text) throws IOException {
    List<TextWithFont> result = new ArrayList<>();
    if (text.length() > 0) {
        PDFont currentFont = null;
        int start = 0;
        for (int i = 0; i < text.length(); ) {
            int codePoint = text.codePointAt(i);
            int codeChars = Character.charCount(codePoint);
            String codePointString = text.substring(i, i + codeChars);
            boolean canEncode = false;
            for (PDFont font : fonts) {
                try {
                    font.encode(codePointString);
                    canEncode = true;
                    if (font != currentFont) {
                        if (currentFont != null) {
                            result.add(new TextWithFont(text.substring(start, i), currentFont));
                        }
                        currentFont = font;
                        start = i;
                    }
                    break;
                } catch (Exception ioe) {
                    // font cannot encode codepoint
                }
            }
            if (!canEncode) {
                throw new IOException("Cannot encode '" + codePointString + "'.");
            }
            i += codeChars;
        }
        result.add(new TextWithFont(text.substring(start, text.length()), currentFont));
    }
    return result;
}

(AddTextWithDynamicFonts方法)

示例使用

像这样使用上面的方法和class

String latinText = "This is latin text";
String japaneseText = "これは日本語です";
String mixedText = "Tこhれiはs日 本i語sで すlatin text";

generatePdfFromStringImproved(latinText).writeTo(new FileOutputStream("Cccompany-Latin-Improved.pdf"));
generatePdfFromStringImproved(japaneseText).writeTo(new FileOutputStream("Cccompany-Japanese-Improved.pdf"));
generatePdfFromStringImproved(mixedText).writeTo(new FileOutputStream("Cccompany-Mixed-Improved.pdf"));

(AddTextWithDynamicFonts 测试 testAddLikeCccompanyImproved)

ByteArrayOutputStream generatePdfFromStringImproved(String content) throws IOException {
    try (   PDDocument doc = new PDDocument();
            InputStream notoSansRegularResource = AddTextWithDynamicFonts.class.getResourceAsStream("NotoSans-Regular.ttf");
            InputStream notoSansCjkRegularResource = AddTextWithDynamicFonts.class.getResourceAsStream("NotoSansCJKtc-Regular.ttf")   ) {
        PDType0Font notoSansRegular = PDType0Font.load(doc, notoSansRegularResource);
        PDType0Font notoSansCjkRegular = PDType0Font.load(doc, notoSansCjkRegularResource);
        List<PDFont> fonts = Arrays.asList(notoSansRegular, notoSansCjkRegular);

        List<TextWithFont> fontifiedContent = fontify(fonts, content);

        PDPage page = new PDPage();
        doc.addPage(page);
        try (   PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
            contentStream.beginText();
            for (TextWithFont textWithFont : fontifiedContent) {
                textWithFont.show(contentStream, 12);
            }
            contentStream.endText();
        }
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        doc.save(os);
        return os;
    }
}

(AddTextWithDynamicFonts辅助方法)

我明白了

对于latinText = "This is latin text"
为 japaneseText = "これは日本語です"
和 mixedText = "Tこhれiはs日本i語sですlatin text"

一些旁白

我将字体检索为 Java 资源，但您可以为它们使用任何类型的 InputStream。
上面的字体选择机制可以很容易地与this answer and the justification extension thereof in this answer
中的换行机制结合起来

Answer 2

下面是将纯文本拆分为 TextWithFont 对象块的另一种实现。算法进行逐字符编码，并始终尝试使用主字体进行编码，只有在失败的情况下才会继续使用后备字体列表中的下一个字体。

主要class属性：

public class SplitByFontsProcessor {

  /** Text to be processed */
  private String text;

  /** List of fonts to be used for processing */
  private List<PDFont> fonts;

  /** Main font to be used for processing */
  private PDFont mainFont;

  /** List of fallback fonts to be used for processing. It does not contain the main font. */
  private List<PDFont> fallbackFonts;

........
}

同一方法内 class:

private List<TextWithFont> splitUsingFallbackFonts() throws IOException {

    final List<TextWithFont> fontifiedText = new ArrayList<>();

    final StringBuilder strBuilder = new StringBuilder();
    boolean isHandledByMainFont = false;

    // Iterator over Unicode codepoints in Java string
    final PrimitiveIterator.OfInt iterator = text.codePoints().iterator();
    while (iterator.hasNext()) {
      int codePoint = iterator.nextInt();
      final String stringCodePoint = new String(Character.toChars(codePoint));

      // try to encode Unicode codepoint
      try {
        // Multi-byte encoding with 1 to 4 bytes.
        mainFont.encode(stringCodePoint); // fails here if can not be handled by the font
        strBuilder.append(stringCodePoint); // append if succeeded to encode
        isHandledByMainFont = true;
      } catch(IllegalArgumentException ex) {
        // IllegalArgumentException is thrown if character can not be handled by a given Font
        // Adding successfully handled characters so far
        if (StringUtils.isNotEmpty(strBuilder.toString())) {
          fontifiedText.add(new TextWithFont(strBuilder.toString(), mainFont));
          strBuilder.setLength(0);// clear StringBuilder
        }

        handleByFallbackFonts(fontifiedText, stringCodePoint);
        isHandledByMainFont = false;
      } // end main font try-catch
    }

    // If this is the last successful run that was handled by main font, then add result
    if (isHandledByMainFont) {
      fontifiedText.add(new TextWithFont(strBuilder.toString(), mainFont));
    }

    return mergeAdjacents(fontifiedText);
  }

方法handleByFallbackFonts():

  private void handleByFallbackFonts(List<TextWithFont> fontifiedText, String stringCodePoint)
      throws IOException {

    final StringBuilder strBuilder = new StringBuilder();
    boolean isHandledByFallbackFont = false;
    // Retry with fallback fonts
    final Iterator<PDFont> fallbackFontsIterator = fallbackFonts.iterator();

    while(fallbackFontsIterator.hasNext()) {
      try {
        final PDFont fallbackFont = fallbackFontsIterator.next();
        fallbackFont.encode(stringCodePoint); // fails here if can not be handled by the font
        isHandledByFallbackFont = true;
        strBuilder.append(stringCodePoint);
        fontifiedText.add(new TextWithFont(strBuilder.toString(), fallbackFont));
        break; // if successfully handled - break the loop
      } catch(IllegalArgumentException exception) {
        // do nothing, proceed to the next font
      }
    } // end while 

    // If character was not handled and this is the last font - throw an exception
    if (!isHandledByFallbackFont) {
      final String fontNames = fonts.stream()
          .map(PDFont::getName)
          .collect(Collectors.joining(", "));

      int codePoint = stringCodePoint.codePointAt(0);

      throw new TextProcessingException(
          String.format("Unicode code point [%s] can not be handled by configured fonts: [%s]",
              codePoint, fontNames));
    }
  }

方法splitUsingFallbackFonts() returns TextWithFont 对象的列表，其中具有相同字体的相邻对象不一定属于同一对象。发生这种情况是因为算法将始终首先重试通过主字体渲染字符，如果失败，它将创建一个新对象，该对象的字体能够渲染该字符。所以我们需要调用一个实用方法，mergeAdjacents()，它将它们合并在一起。

 private static List<TextWithFont> mergeAdjacents(final List<TextWithFont> fontifiedText) {

    final Deque<TextWithFont> result = new LinkedList<>();

    for (TextWithFont elem : fontifiedText) {
      final TextWithFont resElem = result.peekLast();
      if (resElem == null || !resElem.getFont().equals(elem.getFont())) {
        result.addLast(elem);
      } else {
        result.addLast(merge(result.pollLast(), elem));
      }
    }

    return new ArrayList<>(result);
  }

使用 PDFBox 处理许多 unicode 字符

Handle many unicode caracters with PDFBox

java

fonts

pdfbox

每个字符的字体选择

示例使用

一些旁白