替换 PDF 文档中的字符串(ITextSharp 或 PdfSharp)
replace string in PDF document (ITextSharp or PdfSharp)
我们使用具有替换 PDF 文档中文本功能的非管理 DLL (http://www.debenu.com/docs/pdf_library_reference/ReplaceTag.php)。
我们正在尝试转向托管解决方案(ITextSharp 或 PdfSharp)。
我知道以前有人问过这个问题,答案是 "you should not do it" 或 "it is not easily supported by PDF"。
然而,存在一个适合我们的解决方案,我们只需要将其转换为 C#。
我应该如何处理它有什么想法吗?
根据你的library reference link, you use the Debenu PDFLibrary function ReplaceTag
. According to this Debenu knowledge base article
the ReplaceTag function simply replaces text in the page’s content stream, so for most documents it wouldn’t have any effect. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed. Essentially it’s the same as doing:
DPL.CombineContentStreams();
string content = DPL.GetContentStreamToString();
DPL.SetPageContentFromString(content.Replace("Moby", "Mary"));
任何通用 PDF 库都应该可以做到这一点,iText(Sharp) 绝对可以:
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
byte[] contentBytes = reader.GetPageContent(1);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(1, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
警告:就像 Debenu 函数一样,对于大多数文档,此代码不会产生任何影响,甚至不会造成破坏。对于一些简单的文档,它可能能够替换内容,但这实际上取决于 PDF 的构建方式。
顺便提一下,Debenu knowledge base article 继续:
If you created a PDF using Debenu Quick PDF Library and a standard font then the ReplaceTag function should work – however, for PDFs created with tools that do subsetted fonts or even kerning (where words will be split up) then the search text probably won’t be in the content in a simple format.
So in short, the ReplaceTag function will only work in some limited scenarios and isn’t a function that you can rely on for searching and replacing text.
因此,如果在您迁移到托管解决方案期间您还更改了源文档的创建方式,则 Debenu PDFLibrary 函数 ReplaceTag
和上面的代码将能够根据需要更改内容。
对于 pdfsharp 用户,这里有一个有点用的函数,我从我的项目中复制了它,它使用了一个实用方法,该方法被其他方法使用,因此没有使用结果。
它忽略了由字距调整创建的白色spaces,因此可能会弄乱结果(所有字符都在同一个 space 中),具体取决于来源 material
public static void ReplaceTextInPdfPage(PdfPage contentPage, string source, string target)
{
ModifyPdfContentStreams(contentPage, stream =>
{
if (!stream.TryUnfilter())
return false;
var search = string.Join("\s*", source.Select(c => c.ToString()));
var stringStream = Encoding.Default.GetString(stream.Value, 0, stream.Length);
if (!Regex.IsMatch(stringStream, search))
return false;
stringStream = Regex.Replace(stringStream, search, target);
stream.Value = Encoding.Default.GetBytes(stringStream);
stream.Zip();
return false;
});
}
public static void ModifyPdfContentStreams(PdfPage contentPage,Func<PdfDictionary.PdfStream, bool> Modification)
{
for (var i = 0; i < contentPage.Contents.Elements.Count; i++)
if (Modification(contentPage.Contents.Elements.GetDictionary(i).Stream))
return;
var resources = contentPage.Elements?.GetDictionary("/Resources");
var xObjects = resources?.Elements.GetDictionary("/XObject");
if (xObjects == null)
return;
foreach (var item in xObjects.Elements.Values.OfType<PdfReference>())
{
var stream = (item.Value as PdfDictionary)?.Stream;
if (stream != null)
if (Modification(stream))
return;
}
}
我们使用具有替换 PDF 文档中文本功能的非管理 DLL (http://www.debenu.com/docs/pdf_library_reference/ReplaceTag.php)。 我们正在尝试转向托管解决方案(ITextSharp 或 PdfSharp)。 我知道以前有人问过这个问题,答案是 "you should not do it" 或 "it is not easily supported by PDF"。 然而,存在一个适合我们的解决方案,我们只需要将其转换为 C#。 我应该如何处理它有什么想法吗?
根据你的library reference link, you use the Debenu PDFLibrary function ReplaceTag
. According to this Debenu knowledge base article
the ReplaceTag function simply replaces text in the page’s content stream, so for most documents it wouldn’t have any effect. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed. Essentially it’s the same as doing:
DPL.CombineContentStreams(); string content = DPL.GetContentStreamToString(); DPL.SetPageContentFromString(content.Replace("Moby", "Mary"));
任何通用 PDF 库都应该可以做到这一点,iText(Sharp) 绝对可以:
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
byte[] contentBytes = reader.GetPageContent(1);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(1, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
警告:就像 Debenu 函数一样,对于大多数文档,此代码不会产生任何影响,甚至不会造成破坏。对于一些简单的文档,它可能能够替换内容,但这实际上取决于 PDF 的构建方式。
顺便提一下,Debenu knowledge base article 继续:
If you created a PDF using Debenu Quick PDF Library and a standard font then the ReplaceTag function should work – however, for PDFs created with tools that do subsetted fonts or even kerning (where words will be split up) then the search text probably won’t be in the content in a simple format.
So in short, the ReplaceTag function will only work in some limited scenarios and isn’t a function that you can rely on for searching and replacing text.
因此,如果在您迁移到托管解决方案期间您还更改了源文档的创建方式,则 Debenu PDFLibrary 函数 ReplaceTag
和上面的代码将能够根据需要更改内容。
对于 pdfsharp 用户,这里有一个有点用的函数,我从我的项目中复制了它,它使用了一个实用方法,该方法被其他方法使用,因此没有使用结果。
它忽略了由字距调整创建的白色spaces,因此可能会弄乱结果(所有字符都在同一个 space 中),具体取决于来源 material
public static void ReplaceTextInPdfPage(PdfPage contentPage, string source, string target)
{
ModifyPdfContentStreams(contentPage, stream =>
{
if (!stream.TryUnfilter())
return false;
var search = string.Join("\s*", source.Select(c => c.ToString()));
var stringStream = Encoding.Default.GetString(stream.Value, 0, stream.Length);
if (!Regex.IsMatch(stringStream, search))
return false;
stringStream = Regex.Replace(stringStream, search, target);
stream.Value = Encoding.Default.GetBytes(stringStream);
stream.Zip();
return false;
});
}
public static void ModifyPdfContentStreams(PdfPage contentPage,Func<PdfDictionary.PdfStream, bool> Modification)
{
for (var i = 0; i < contentPage.Contents.Elements.Count; i++)
if (Modification(contentPage.Contents.Elements.GetDictionary(i).Stream))
return;
var resources = contentPage.Elements?.GetDictionary("/Resources");
var xObjects = resources?.Elements.GetDictionary("/XObject");
if (xObjects == null)
return;
foreach (var item in xObjects.Elements.Values.OfType<PdfReference>())
{
var stream = (item.Value as PdfDictionary)?.Stream;
if (stream != null)
if (Modification(stream))
return;
}
}