从 PDF 文件中提取值到变量

Question

我正在尝试从 PDF 文件中获取 "Invoice number"，在本例中为 INV-3337，并希望将其存储为变量以供将来在代码中使用。

目前我正在研究示例并将此 PDF 用于测试目的： https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

使用我当前的代码，我能够将整个内容解析为 .txt 格式。有人可以指导我如何只获取需要的值并将其存储到变量中吗？直接用itextsharp可以吗？还是需要先全部解析成.txt文件，再解析.txt文件，将值存为变量，删除.txt文件再继续？

注意！实际设置中会有很多PDF文件需要解析。

这是我当前的代码：

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;


namespace PDF_parser
{
    class Program
    {
        static void Main(string[] args)
        {

            string filePath = @"C:\temp\parser\Invoice_Template.pdf";
            string outPath = @"C:\temp\parser\Invoice_Template.txt";
            int pagesToScan = 2;

            string strText = string.Empty;
            try
            {
                PdfReader reader = new PdfReader(filePath);

                for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
                {
                    ITextExtractionStrategy its = new LocationTextExtractionStrategy();
                    strText = PdfTextExtractor.GetTextFromPage(reader, page, its);

                    strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
                    //creating the string array and storing the PDF line by line
                    string[] lines = strText.Split('\n');
                    foreach (string line in lines)
                    {
                        //Creating and appending to a text file
                        using (StreamWriter file = new StreamWriter(outPath, true))
                        {
                            file.WriteLine(line);
                        }
                    }
                }

                reader.Close();
            }
            catch (Exception ex)
            {
                Console.Write(ex);
            }
        }
    }
}

编辑：

我没看错吧？

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;


namespace PDF_parser
{
    class Program
    {
        static void Main(string[] args)
        {

            string filePath = @"C:\temp\parser\Invoice_Template.pdf";
            string outPath = @"C:\temp\parser\Invoice_Template.txt";
            int pagesToScan = 2;

            string strText = string.Empty;
            try
            {
                PdfReader reader = new PdfReader(filePath);

                for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
                {
                    ITextExtractionStrategy its = new LocationTextExtractionStrategy();
                    strText = PdfTextExtractor.GetTextFromPage(reader, page, its);

                    strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
                    //creating the string array and storing the PDF line by line
                    string[] lines = strText.Split('\n');
                    foreach (string line in lines)
                    {
                        //Creating and appending to a text file
                        using (StreamWriter file = new StreamWriter(outPath, true))
                        {
                            // file.WriteLine(line);

                           int indexOccurrance = line.LastIndexOf("Invoice Number");
                           if(indexOccurrance > 0)
                           {
                           var invoiceNumber = line.Substring(indexOccurrance, (line.Length - indexOccurrance) );
                           }
                        }
                    }
                }

                reader.Close();
            }
            catch (Exception ex)
            {
                Console.Write(ex);
            }
        }
    }
}

Answer 1

一个选项是使用 LastIndexOf 在每一行文本中搜索 "Invoice Number"。如果找到，则使用 Substring 获取该行的其余部分（将是 Invoice Number）

类似于：

int indexOccurrance = line.LastIndexOf("Invoice Number");
if(indexOccurrance > 0)
{
  var invoiceNumber = line.Substring(indexOccurrance, (line.Length - indexOccurrance) );
}

从 PDF 文件中提取值到变量

Extract value from PDF file to variable

c#

pdf

parsing

itext

pdf-parsing