为什么扫描的 pdf 页面以顺时针旋转 90 度返回 - 将其提取为图像时？

Question

我使用 iText 7 将 pdf 页面转换为图像（来自扫描文档），以便我可以使用 ocr 处理它。对于某些 pdf 文件，这非常有效，但对于其他文件，图像 "extracted" 会以 90 度旋转返回！

考虑到工作正常的文档：我打开word，输入一些文字和图片，然后将文件转换为pdf。当对此类文件使用 iText 7 时，我可以毫无问题地获取文本和图像！

考虑导致问题的文档：我扫描一封信，然后将一个 pdf 文件 X 发送到我的电子邮件中。 X只有一个图像层。如果我用 iText 7 解析 X 并从我得到的字节数组创建一个新图像（使用事件类型 Render_IMAGE 的 EventListener），图像是用 90 度旋转创建的？？？

因此，对于两个文档，我使用相同的 C# 代码，但输出不同...

我使用了 X（带旋转的）的输出图像并将其转换为 pdf 文件。让我们称之为 Y。所以当我再次从 Y 创建图像时，新图像与 Y 相比没有旋转！ - 我只是做这个测试，看看图像是否总是旋转......

//IEventListener 的实现：

 public void EventOccurred(IEventData data, EventType type)
    {
        switch (type)
        {
            case EventType.RENDER_IMAGE:
                String filename;
                ImageRenderInfo renderInfo = (ImageRenderInfo)data;
                PdfImageXObject image = renderInfo.GetImage();
                if (image == null)
                {
                    return;
                }
                byte[] imageBytes = image.GetImageBytes(true);
                extension = image.IdentifyImageFileExtension();
                filename = String.Format(@"{0}\{1}.{2}", path, Guid.NewGuid().ToString(), extension);
                images.Add(new ImageStreamObject(imageBytes, filename));
                break;
        }
    }

//ClassImageStreamObject

public class ImageStreamObject
{
    byte[] image;
    string path;

    /// <summary>
    /// Creates a data object for storing an image as a byte array and its filepath.
    /// </summary>
    /// <param name="byteArray"></param>
    /// <param name="filePath"></param>
    public ImageStreamObject(byte[] byteArray, string filePath)
    {
        image = byteArray;
        path = filePath;
    }

    public String GetImagePath()
    {
        return path;
    }

//做图像的对象的构造函数"extraction":

    public PdfImageExtractor(string filePath, string imageOutputPath)
    {
        pdf = new PdfDocument(new PdfReader(filePath));
        listener = new ImageRenderListener(imageOutputPath);
        parser = new PdfCanvasProcessor(listener);
        imageBuffer = new List<string>();
    }

//PdfImageExtractor 创建图像文件的方法：

    public List<string> CreateImagesFromPdfPage(int page)
    {
        FileStream fs;
        byte[] tempImage;
        string tempPath;
        listener.GetImageStreamObjects().Clear();
        parser.ProcessPageContent(pdf.GetPage(page));
        imageStreamObjects = listener.GetImageStreamObjects();
        List<string> pathes = GetImagePathes();
        imageStreamObjects.ForEach(delegate (ImageStreamObject imageStreamObject)
        {
            tempPath = imageStreamObject.GetImagePath();
            tempImage = imageStreamObject.GetImageAsByteArray();
            fs = new FileStream(tempPath, FileMode.Create);
            fs.Write(tempImage, 0, tempImage.Length);
            fs.Flush();
            fs.Close();
        });
        return pathes;
    }

Answer 1

您提取的位图图像与作为资源存储在 PDF 中的完全一样（至少在方向上是这样）。但是每当绘制位图资源时，它都受制于绘制时的当前变换矩阵，并且当前变换可以显着旋转、倾斜、平移和拉伸位图。

您可以使用

从ImageRenderInfo renderInfo绘制位图时检索当前变换矩阵的值

Matrix ctm = renderInfo.GetImageCtm();

分析一下。此外，您必须考虑页面旋转，您可以从页码 page 中检索到

int rotation = pdf.GetPage(page).GetRotation()

为什么扫描的 pdf 页面以顺时针旋转 90 度返回 - 将其提取为图像时？

Why is a scanned pfd page returned in 90 clockwise rotation - when extracting it as an image?

c#

itext7