使用 PDFsharp 和 MigraDoc 写入和读取 PDF
Using PDFsharp and MigraDoc to write to and then read from a PDF
我正在尝试为我们的 PDF 生成例程编写验证代码,但我很难让 PDFsharp 从使用 MigraDoc 创建的文件中提取文本。 ExtractText 代码适用于其他 PDF,但不适用于我使用 MigraDoc 生成的 PDF(请参阅下面的代码。)
关于我做错了什么的提示?
//Create the Doc
var doc = new MigraDoc.DocumentObjectModel.Document();
doc.Info.Title = "VerifyReadWrite";
var section = doc.AddSection();
section.AddParagraph("ABCDEF abcdef");
//Render the PDF
var renderer = new PdfDocumentRenderer(true);
var pdf = new PdfDocument();
renderer.PdfDocument = pdf;
renderer.Document = doc;
renderer.RenderDocument();
var msOut = new MemoryStream();
pdf.Save(msOut, true);
var pdfBytes = msOut.ToArray();
//Read the PDF into PdfSharp
var ms = new MemoryStream(pdfBytes);
var pdfRead = PdfSharp.Pdf.IO.PdfReader.Open(ms, PdfDocumentOpenMode.ReadOnly);
var segments = pdfRead.Pages[0].ExtractText().ToList();
结果如下:
段[0] = "\0$\0%\0&\0'\0(\0)"
段[1] =“[=33=]D[=33=]E[=33=]F[=33=]G[=33=]H[=33=]I”
我希望看到:
段[0] = "ABCDEF"
段[1] = "abcdef"
我正在使用此处的 ExtractText 代码:
C# Extract text from PDF using PdfSharp
除了使用 MigraDoc 生成的 PDF 之外,它对所有文件都非常有效。
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text.Select(x => x.Trim());
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = (COperator) cObject;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else
{
var sequence = cObject as CSequence;
if (sequence != null)
{
var cSequence = sequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = (CString) cObject;
yield return cString.Value;
}
}
}
似乎用于提取文本的代码并不支持所有情况。
试试 new PdfDocumentRenderer(false)
(而不是 'true')。据我所知,这将导致不同的编码,并且文本提取可能会起作用。
我正在尝试为我们的 PDF 生成例程编写验证代码,但我很难让 PDFsharp 从使用 MigraDoc 创建的文件中提取文本。 ExtractText 代码适用于其他 PDF,但不适用于我使用 MigraDoc 生成的 PDF(请参阅下面的代码。)
关于我做错了什么的提示?
//Create the Doc
var doc = new MigraDoc.DocumentObjectModel.Document();
doc.Info.Title = "VerifyReadWrite";
var section = doc.AddSection();
section.AddParagraph("ABCDEF abcdef");
//Render the PDF
var renderer = new PdfDocumentRenderer(true);
var pdf = new PdfDocument();
renderer.PdfDocument = pdf;
renderer.Document = doc;
renderer.RenderDocument();
var msOut = new MemoryStream();
pdf.Save(msOut, true);
var pdfBytes = msOut.ToArray();
//Read the PDF into PdfSharp
var ms = new MemoryStream(pdfBytes);
var pdfRead = PdfSharp.Pdf.IO.PdfReader.Open(ms, PdfDocumentOpenMode.ReadOnly);
var segments = pdfRead.Pages[0].ExtractText().ToList();
结果如下:
段[0] = "\0$\0%\0&\0'\0(\0)"
段[1] =“[=33=]D[=33=]E[=33=]F[=33=]G[=33=]H[=33=]I”
我希望看到:
段[0] = "ABCDEF"
段[1] = "abcdef"
我正在使用此处的 ExtractText 代码: C# Extract text from PDF using PdfSharp
除了使用 MigraDoc 生成的 PDF 之外,它对所有文件都非常有效。
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text.Select(x => x.Trim());
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = (COperator) cObject;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else
{
var sequence = cObject as CSequence;
if (sequence != null)
{
var cSequence = sequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = (CString) cObject;
yield return cString.Value;
}
}
}
似乎用于提取文本的代码并不支持所有情况。
试试 new PdfDocumentRenderer(false)
(而不是 'true')。据我所知,这将导致不同的编码,并且文本提取可能会起作用。