从 PDF 文件中提取值到变量
Extract value from PDF file to variable
我正在尝试从 PDF 文件中获取 "Invoice number",在本例中为 INV-3337
,并希望将其存储为变量以供将来在代码中使用。
目前我正在研究示例并将此 PDF 用于测试目的:
https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
使用我当前的代码,我能够将整个内容解析为 .txt
格式。有人可以指导我如何只获取需要的值并将其存储到变量中吗?直接用itextsharp
可以吗?还是需要先全部解析成.txt文件,再解析.txt文件,将值存为变量,删除.txt文件再继续?
注意!实际设置中会有很多PDF文件需要解析。
这是我当前的代码:
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;
namespace PDF_parser
{
class Program
{
static void Main(string[] args)
{
string filePath = @"C:\temp\parser\Invoice_Template.pdf";
string outPath = @"C:\temp\parser\Invoice_Template.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (StreamWriter file = new StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
编辑:
我没看错吧?
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;
namespace PDF_parser
{
class Program
{
static void Main(string[] args)
{
string filePath = @"C:\temp\parser\Invoice_Template.pdf";
string outPath = @"C:\temp\parser\Invoice_Template.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (StreamWriter file = new StreamWriter(outPath, true))
{
// file.WriteLine(line);
int indexOccurrance = line.LastIndexOf("Invoice Number");
if(indexOccurrance > 0)
{
var invoiceNumber = line.Substring(indexOccurrance, (line.Length - indexOccurrance) );
}
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
一个选项是使用 LastIndexOf
在每一行文本中搜索 "Invoice Number"
。
如果找到,则使用 Substring
获取该行的其余部分(将是 Invoice Number
)
类似于:
int indexOccurrance = line.LastIndexOf("Invoice Number");
if(indexOccurrance > 0)
{
var invoiceNumber = line.Substring(indexOccurrance, (line.Length - indexOccurrance) );
}
我正在尝试从 PDF 文件中获取 "Invoice number",在本例中为 INV-3337
,并希望将其存储为变量以供将来在代码中使用。
目前我正在研究示例并将此 PDF 用于测试目的: https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
使用我当前的代码,我能够将整个内容解析为 .txt
格式。有人可以指导我如何只获取需要的值并将其存储到变量中吗?直接用itextsharp
可以吗?还是需要先全部解析成.txt文件,再解析.txt文件,将值存为变量,删除.txt文件再继续?
注意!实际设置中会有很多PDF文件需要解析。
这是我当前的代码:
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;
namespace PDF_parser
{
class Program
{
static void Main(string[] args)
{
string filePath = @"C:\temp\parser\Invoice_Template.pdf";
string outPath = @"C:\temp\parser\Invoice_Template.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (StreamWriter file = new StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
编辑:
我没看错吧?
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;
namespace PDF_parser
{
class Program
{
static void Main(string[] args)
{
string filePath = @"C:\temp\parser\Invoice_Template.pdf";
string outPath = @"C:\temp\parser\Invoice_Template.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (StreamWriter file = new StreamWriter(outPath, true))
{
// file.WriteLine(line);
int indexOccurrance = line.LastIndexOf("Invoice Number");
if(indexOccurrance > 0)
{
var invoiceNumber = line.Substring(indexOccurrance, (line.Length - indexOccurrance) );
}
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
一个选项是使用 LastIndexOf
在每一行文本中搜索 "Invoice Number"
。
如果找到,则使用 Substring
获取该行的其余部分(将是 Invoice Number
)
类似于:
int indexOccurrance = line.LastIndexOf("Invoice Number");
if(indexOccurrance > 0)
{
var invoiceNumber = line.Substring(indexOccurrance, (line.Length - indexOccurrance) );
}