iTextSharp PdfTextExtractor GetTextFromPage 抛出 NullReferenceException

Question

我正在使用 iTextSharp 阅读 PDF 文档，但最近我似乎得到了

{"Object reference not set to an instance of an object."}

或从 PdfReader 页面获取文本时出现 NullReferenceException。在它工作之前但在这一天之后，它还没有工作。我没有更改我的代码。

下面是我的代码：

for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(reader, i, its);
            if (currentText.Contains("ADVANCES"))
            {
                return i;
            }
        }

        return 0;

以上代码抛出空引用异常，reader is not null and i is obviously not null being an int.

我正在从输入流

实例化 PDFreader

PdfReader reader = new PdfReader(_stream)

下面是堆栈跟踪：

  at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)

为简单起见，我尝试创建一个简单的控制台应用程序，它将只读取 PDF 文件中的所有文本并显示它。下面是代码。结果与上面相同，它给出了 NullReferenceException。

class Program
    {



 static void Main(string[] args)
    {
        Console.WriteLine(ExtractTextFromPdf(@"stockQuotes_03232015.pdf"));
    }

    public static string ExtractTextFromPdf(string path)
    {
        using (PdfReader reader = new PdfReader(path))
        {
            StringBuilder text = new StringBuilder();

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
            }

            return text.ToString();
        }
    }
}

有谁知道这里可能发生了什么或者我可以如何解决它？

Answer 1

总结问题评论中发现的内容...

简而言之

OP 最初使用的 PDF 无效：它缺少解析器感兴趣的必需对象。

自从他终于掌握了一个有效的版本，他现在能够成功解析了。

详细

根据请求的时间和方式，the web site the PDFs in question were requested from返回同一文档的不同版本，有时是完整的，有时是不完整的。

测试文件是 stockQuotes_03232015.pdf，即包含测试当天生成的数据的 PDF：

完整文件已经可以通过大小识别，在我的下载中它是 250933 字节长，而我的不完整文件是 81062 字节长。

检查文件，似乎不完整的文件是通过某种工具从完整的文件派生而来的，该工具删除了重复的图像流，但忘记通过引用保留的流对象来更改对已删除流的引用。

Answer 2

请使用下面的代码来阅读 PDF 中的文本。它在 RichTextBox 中显示来自 PDF 的文本，即 - richTextBox1。

参考 Youtube：https://www.youtube.com/watch?v=22C9N4WP4-s

        using (OpenFileDialog ofd = new OpenFileDialog() { Filter = "PDF files|*.pdf", ValidateNames = true, Multiselect = false })
        {
            if(ofd.ShowDialog() == DialogResult.OK)
            {
                try
                {
                    iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(ofd.FileName);
                    StringBuilder sb = new StringBuilder();
                    for(int i = 1; i<reader.NumberOfPages; i++)
                    {
                        sb.Append(iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader,i));
                    }
                    richTextBox1.Text = sb.ToString();
                    reader.Close();

                }
                catch (Exception ex)
                {
                    MessageBox.Show(ex.Message, "Message", MessageBoxButtons.OK, MessageBoxIcon.Error);
                }
            }
        }

iTextSharp PdfTextExtractor GetTextFromPage 抛出 NullReferenceException

iTextSharp PdfTextExtractor GetTextFromPage Throwing NullReferenceException

c#

pdf

pdf-generation

itextsharp

简而言之

详细