C# iTextSharp - 代码覆盖而不是附加页面
C# iTextSharp - Code overwriting instead of appending pages
我看过很多帮助我达到目标的帖子,我是编程新手。我的目的是获取目录 "sourceDir" 中的文件并查找正则表达式匹配。当它找到匹配项时,我想创建一个以匹配项作为名称的新文件。如果代码找到另一个具有相同匹配项的文件(该文件已经存在),则在该文档中创建一个新页面。
现在代码可以工作,但是它不会添加新页面,而是会覆盖文档的第一页。注意:目录中的每个文档只有一页!
string sourceDir = @"C:\Users\bob\Desktop\results\";
string destDir = @"C:\Users\bob\Desktop\results\final\";
string[] files = Directory.GetFiles(sourceDir);
foreach (string file in files)
{
using (var pdfReader = new PdfReader(file.ToString()))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
var text = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
var currentText =
PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
Regex reg = new Regex(@"ABCDEFG");
MatchCollection matches = reg.Matches(currentText);
foreach (Match m in matches)
{
string newFile = destDir + m.ToString() + ".pdf";
if (!File.Exists(newFile))
{
using (PdfReader reader = new PdfReader(File.ReadAllBytes(file)))
{
using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
{
using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.Create)))
{
var importedPage = copy.GetImportedPage(reader, page);
doc.Open();
copy.AddPage(importedPage);
doc.Close();
}
}
}
}
else
{
using (PdfReader reader = new PdfReader(File.ReadAllBytes(newFile)))
{
using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
{
using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.OpenOrCreate)))
{
var importedPage = copy.GetImportedPage(reader, page);
doc.Open();
copy.AddPage(importedPage);
doc.Close();
}
}
}
}
}
}
}
}
我会用伪代码写这个。
你这样做:
// loop over different single-page documents
for () {
// introduce a condition
if (condition == met) {
// create single-page PDF
new Document();
new PdfCopy();
document.Open();
copy.add(singlePage);
document.Close();
}
}
这意味着每次满足条件时您都在创建单页 PDF。顺便说一句,您多次覆盖现有文件。
你应该做的,是这样的:
// Create a document with as many pages as times a condition is met
new Document();
new PdfCopy();
document.Open();
// loop over different single-page documents
for () {
// introduce a condition
if (condition == met) {
copy.addPage(singlePage);
}
}
document.Close();
现在您可能向使用 PdfCopy
创建的新文档添加了不止一页。注意:如果条件不满足,可能会抛出异常。
Bruno 很好地解释了这个问题以及如何解决它,但是由于您已经说过您是编程新手并且您已经进一步 posted a very similar and related question 我将更深入地介绍希望对你有帮助。
首先,让我们记下已知信息:
- 有一个充满 PDF 的目录
- 每个 PDF 只有一页
那么目标:
- 提取每个 PDF 的文本
- 将提取的文本与模式进行比较
- 如果有匹配项,则使用文件名的匹配项执行以下操作之一:
- 如果文件存在,将源 PDF 添加到它
- 如果不匹配,使用 PDF 创建一个新文件
在继续之前,您需要了解几件事。您尝试使用 FileMode.OpenOrCreate
在 "append mode" 中工作。这是一个很好的猜测,但不正确。 PDF 格式既有开始也有结束,所以 "start here" 和 "end here"。当您尝试将另一个 PDF(或与此相关的任何内容)附加到现有文件时,您只是在写入 "end here" 部分。充其量,这是被忽略的垃圾数据,但您更有可能最终得到损坏的 PDF。几乎所有文件格式都是如此。两个 XML 文件串联是无效的,因为一个 XML 文档只能有一个根元素。
其次,但相关的是,iText/iTextSharp 无法编辑现有文件。这个非常重要。但是,它可以创建全新的文件,这些文件恰好具有与其他文件完全相同或可能经过修改的版本。我不知道我是否可以强调这有多重要。
第三,您使用的行被一遍又一遍地复制,但这是非常错误的,实际上可能会破坏您的数据。至于为什么不好,read this.
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
第四,您使用的是 RegEx,这是一种过于复杂的搜索方式。也许您发布的代码只是一个示例,但如果不是,我建议您只使用 currentText.Contains("")
或者如果您需要忽略大小写 currentText.IndexOf( "", StringComparison.InvariantCultureIgnoreCase )
。为了免除疑问,下面的代码假设您有一个更复杂的 RegEx。
综上所述,下面是一个完整的工作示例,可以引导您完成所有操作。由于我们无法访问您的 PDF,因此第二部分实际上创建了 100 个示例 PDF,其中偶尔会添加我们的搜索词。您的真实代码显然不会这样做,但我们需要共同点才能与您合作。第三部分是您尝试执行的搜索和合并功能。希望代码中的注释能说明一切。
/**
* Step 1 - Variable Setup
*/
//This is the folder that we'll be basing all other directory paths on
var workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
//This folder will hold our PDFs with text that we're searching for
var folderPathContainingPdfsToSearch = Path.Combine(workingFolder, "Pdfs");
var folderPathContainingPdfsCombined = Path.Combine(workingFolder, "Pdfs Combined");
//Create our directories if they don't already exist
System.IO.Directory.CreateDirectory(folderPathContainingPdfsToSearch);
System.IO.Directory.CreateDirectory(folderPathContainingPdfsCombined);
var searchText1 = "ABC";
var searchText2 = "DEF";
/**
* Step 2 - Create sample PDFs
*/
//Create 100 sample PDFs
for (var i = 0; i < 100; i++) {
using (var fs = new FileStream(Path.Combine(folderPathContainingPdfsToSearch, i.ToString() + ".pdf"), FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
//Add a title so we know what page we're on when we combine
doc.Add(new Paragraph(String.Format("This is page {0}", i)));
//Add various strings every once in a while.
//(Yes, I know this isn't evenly distributed but I haven't
// had enough coffee yet.)
if (i % 10 == 3) {
doc.Add(new Paragraph(searchText1));
} else if (i % 10 == 6) {
doc.Add(new Paragraph(searchText2));
} else if (i % 10 == 9) {
doc.Add(new Paragraph(searchText1 + searchText2));
} else {
doc.Add(new Paragraph("Blah blah blah"));
}
doc.Close();
}
}
}
}
/**
* Step 3 - Search and merge
*/
//We'll search for two different strings just to add some spice
var reg = new Regex("(" + searchText1 + "|" + searchText2 + ")");
//Loop through each file in the directory
foreach (var filePath in Directory.EnumerateFiles(folderPathContainingPdfsToSearch, "*.pdf")) {
using (var pdfReader = new PdfReader(filePath)) {
for (var page = 1; page <= pdfReader.NumberOfPages; page++) {
//Get the text from the page
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new SimpleTextExtractionStrategy());
currentText.IndexOf( "", StringComparison.InvariantCultureIgnoreCase )
//DO NOT DO THIS EVER!! See this for why
//currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
//Match our pattern against the extracted text
var matches = reg.Matches(currentText);
//Bail early if we can
if (matches.Count == 0) {
continue;
}
//Loop through each match
foreach (var m in matches) {
//This is the file path that we want to target
var destFile = Path.Combine(folderPathContainingPdfsCombined, m.ToString() + ".pdf");
//If the file doesn't already exist then just copy the file and move on
if (!File.Exists(destFile)) {
System.IO.File.Copy(filePath, destFile);
continue;
}
//The file exists so we're going to "append" the page
//However, writing to the end of file in Append mode doesn't work,
//that would be like "add a file to a zip" by concatenating two
//two files. In this case, we're actually creating a brand new file
//that "happens" to contain the original file and the matched file.
//Instead of writing to disk for this new file we're going to keep it
//in memory, delete the original file and write our new file
//back onto the old file
using (var ms = new MemoryStream()) {
//Use a wrapper helper provided by iText
var cc = new PdfConcatenate(ms);
//Open for writing
cc.Open();
//Import the existing file
using (var subReader = new PdfReader(destFile)) {
cc.AddPages(subReader);
}
//Import the matched file
//The OP stated a guarantee of only 1 page so we don't
//have to mess around with specify which page to import.
//Also, PdfConcatenate closes the supplied PdfReader so
//just use the variable pdfReader.
using (var subReader = new PdfReader(filePath)) {
cc.AddPages(subReader);
}
//Close for writing
cc.Close();
//Erase our exisiting file
File.Delete(destFile);
//Write our new file
File.WriteAllBytes(destFile, ms.ToArray());
}
}
}
}
}
我看过很多帮助我达到目标的帖子,我是编程新手。我的目的是获取目录 "sourceDir" 中的文件并查找正则表达式匹配。当它找到匹配项时,我想创建一个以匹配项作为名称的新文件。如果代码找到另一个具有相同匹配项的文件(该文件已经存在),则在该文档中创建一个新页面。
现在代码可以工作,但是它不会添加新页面,而是会覆盖文档的第一页。注意:目录中的每个文档只有一页!
string sourceDir = @"C:\Users\bob\Desktop\results\";
string destDir = @"C:\Users\bob\Desktop\results\final\";
string[] files = Directory.GetFiles(sourceDir);
foreach (string file in files)
{
using (var pdfReader = new PdfReader(file.ToString()))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
var text = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
var currentText =
PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
Regex reg = new Regex(@"ABCDEFG");
MatchCollection matches = reg.Matches(currentText);
foreach (Match m in matches)
{
string newFile = destDir + m.ToString() + ".pdf";
if (!File.Exists(newFile))
{
using (PdfReader reader = new PdfReader(File.ReadAllBytes(file)))
{
using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
{
using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.Create)))
{
var importedPage = copy.GetImportedPage(reader, page);
doc.Open();
copy.AddPage(importedPage);
doc.Close();
}
}
}
}
else
{
using (PdfReader reader = new PdfReader(File.ReadAllBytes(newFile)))
{
using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
{
using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.OpenOrCreate)))
{
var importedPage = copy.GetImportedPage(reader, page);
doc.Open();
copy.AddPage(importedPage);
doc.Close();
}
}
}
}
}
}
}
}
我会用伪代码写这个。
你这样做:
// loop over different single-page documents
for () {
// introduce a condition
if (condition == met) {
// create single-page PDF
new Document();
new PdfCopy();
document.Open();
copy.add(singlePage);
document.Close();
}
}
这意味着每次满足条件时您都在创建单页 PDF。顺便说一句,您多次覆盖现有文件。
你应该做的,是这样的:
// Create a document with as many pages as times a condition is met
new Document();
new PdfCopy();
document.Open();
// loop over different single-page documents
for () {
// introduce a condition
if (condition == met) {
copy.addPage(singlePage);
}
}
document.Close();
现在您可能向使用 PdfCopy
创建的新文档添加了不止一页。注意:如果条件不满足,可能会抛出异常。
Bruno 很好地解释了这个问题以及如何解决它,但是由于您已经说过您是编程新手并且您已经进一步 posted a very similar and related question 我将更深入地介绍希望对你有帮助。
首先,让我们记下已知信息:
- 有一个充满 PDF 的目录
- 每个 PDF 只有一页
那么目标:
- 提取每个 PDF 的文本
- 将提取的文本与模式进行比较
- 如果有匹配项,则使用文件名的匹配项执行以下操作之一:
- 如果文件存在,将源 PDF 添加到它
- 如果不匹配,使用 PDF 创建一个新文件
在继续之前,您需要了解几件事。您尝试使用 FileMode.OpenOrCreate
在 "append mode" 中工作。这是一个很好的猜测,但不正确。 PDF 格式既有开始也有结束,所以 "start here" 和 "end here"。当您尝试将另一个 PDF(或与此相关的任何内容)附加到现有文件时,您只是在写入 "end here" 部分。充其量,这是被忽略的垃圾数据,但您更有可能最终得到损坏的 PDF。几乎所有文件格式都是如此。两个 XML 文件串联是无效的,因为一个 XML 文档只能有一个根元素。
其次,但相关的是,iText/iTextSharp 无法编辑现有文件。这个非常重要。但是,它可以创建全新的文件,这些文件恰好具有与其他文件完全相同或可能经过修改的版本。我不知道我是否可以强调这有多重要。
第三,您使用的行被一遍又一遍地复制,但这是非常错误的,实际上可能会破坏您的数据。至于为什么不好,read this.
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
第四,您使用的是 RegEx,这是一种过于复杂的搜索方式。也许您发布的代码只是一个示例,但如果不是,我建议您只使用 currentText.Contains("")
或者如果您需要忽略大小写 currentText.IndexOf( "", StringComparison.InvariantCultureIgnoreCase )
。为了免除疑问,下面的代码假设您有一个更复杂的 RegEx。
综上所述,下面是一个完整的工作示例,可以引导您完成所有操作。由于我们无法访问您的 PDF,因此第二部分实际上创建了 100 个示例 PDF,其中偶尔会添加我们的搜索词。您的真实代码显然不会这样做,但我们需要共同点才能与您合作。第三部分是您尝试执行的搜索和合并功能。希望代码中的注释能说明一切。
/**
* Step 1 - Variable Setup
*/
//This is the folder that we'll be basing all other directory paths on
var workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
//This folder will hold our PDFs with text that we're searching for
var folderPathContainingPdfsToSearch = Path.Combine(workingFolder, "Pdfs");
var folderPathContainingPdfsCombined = Path.Combine(workingFolder, "Pdfs Combined");
//Create our directories if they don't already exist
System.IO.Directory.CreateDirectory(folderPathContainingPdfsToSearch);
System.IO.Directory.CreateDirectory(folderPathContainingPdfsCombined);
var searchText1 = "ABC";
var searchText2 = "DEF";
/**
* Step 2 - Create sample PDFs
*/
//Create 100 sample PDFs
for (var i = 0; i < 100; i++) {
using (var fs = new FileStream(Path.Combine(folderPathContainingPdfsToSearch, i.ToString() + ".pdf"), FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
//Add a title so we know what page we're on when we combine
doc.Add(new Paragraph(String.Format("This is page {0}", i)));
//Add various strings every once in a while.
//(Yes, I know this isn't evenly distributed but I haven't
// had enough coffee yet.)
if (i % 10 == 3) {
doc.Add(new Paragraph(searchText1));
} else if (i % 10 == 6) {
doc.Add(new Paragraph(searchText2));
} else if (i % 10 == 9) {
doc.Add(new Paragraph(searchText1 + searchText2));
} else {
doc.Add(new Paragraph("Blah blah blah"));
}
doc.Close();
}
}
}
}
/**
* Step 3 - Search and merge
*/
//We'll search for two different strings just to add some spice
var reg = new Regex("(" + searchText1 + "|" + searchText2 + ")");
//Loop through each file in the directory
foreach (var filePath in Directory.EnumerateFiles(folderPathContainingPdfsToSearch, "*.pdf")) {
using (var pdfReader = new PdfReader(filePath)) {
for (var page = 1; page <= pdfReader.NumberOfPages; page++) {
//Get the text from the page
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new SimpleTextExtractionStrategy());
currentText.IndexOf( "", StringComparison.InvariantCultureIgnoreCase )
//DO NOT DO THIS EVER!! See this for why
//currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
//Match our pattern against the extracted text
var matches = reg.Matches(currentText);
//Bail early if we can
if (matches.Count == 0) {
continue;
}
//Loop through each match
foreach (var m in matches) {
//This is the file path that we want to target
var destFile = Path.Combine(folderPathContainingPdfsCombined, m.ToString() + ".pdf");
//If the file doesn't already exist then just copy the file and move on
if (!File.Exists(destFile)) {
System.IO.File.Copy(filePath, destFile);
continue;
}
//The file exists so we're going to "append" the page
//However, writing to the end of file in Append mode doesn't work,
//that would be like "add a file to a zip" by concatenating two
//two files. In this case, we're actually creating a brand new file
//that "happens" to contain the original file and the matched file.
//Instead of writing to disk for this new file we're going to keep it
//in memory, delete the original file and write our new file
//back onto the old file
using (var ms = new MemoryStream()) {
//Use a wrapper helper provided by iText
var cc = new PdfConcatenate(ms);
//Open for writing
cc.Open();
//Import the existing file
using (var subReader = new PdfReader(destFile)) {
cc.AddPages(subReader);
}
//Import the matched file
//The OP stated a guarantee of only 1 page so we don't
//have to mess around with specify which page to import.
//Also, PdfConcatenate closes the supplied PdfReader so
//just use the variable pdfReader.
using (var subReader = new PdfReader(filePath)) {
cc.AddPages(subReader);
}
//Close for writing
cc.Close();
//Erase our exisiting file
File.Delete(destFile);
//Write our new file
File.WriteAllBytes(destFile, ms.ToArray());
}
}
}
}
}