iText GetTextFromPage returns 每页从头开始的文本
iText GetTextFromPage returns the text from the begining for evey page
我有这个简单的作品。这个问题很奇怪 - 在每次迭代中,reader returns 自 pdf 文档开始以来的整个文本。
大概就是这么简单,但是我看不出来。
...
PdfReader reader = new PdfReader ( path );
PdfReaderContentParser parser = new PdfReaderContentParser ( reader );
...
public void Read(int start, int end)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
StringBuilder sb = new StringBuilder();
for (int page = start; page < end; page++)
{
try
{
sb.Append(PdfTextExtractor.GetTextFromPage(reader, page, strategy));
}
catch (Exception ex)
{
throw new PdfException(ex.Message, ex.InnerException);
}
var p = new Page { Number = page, Content = sb.ToString()};
sb.Clear();
PageParsed?.Invoke(this, new PdfEventArgs<Page>(p));
}
FileParsed?.Invoke(this, new PdfEventArgs<string>(string.IsNullOrEmpty(Name) ? "File parsed" : Name));
}
strategy
对象保持状态,因此您必须像这样在循环内移动对象实例化:
StringBuilder sb = new StringBuilder();
for (int page = start; page < end; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
try
{
sb.Append(PdfTextExtractor.GetTextFromPage(reader, page, strategy));
}
catch (Exception ex)
{
throw new PdfException(ex.Message, ex.InnerException);
}
var p = new Page { Number = page, Content = sb.ToString()};
sb.Clear();
PageParsed?.Invoke(this, new PdfEventArgs<Page>(p));
}
这将解决您的问题。
我有这个简单的作品。这个问题很奇怪 - 在每次迭代中,reader returns 自 pdf 文档开始以来的整个文本。 大概就是这么简单,但是我看不出来。
...
PdfReader reader = new PdfReader ( path );
PdfReaderContentParser parser = new PdfReaderContentParser ( reader );
...
public void Read(int start, int end)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
StringBuilder sb = new StringBuilder();
for (int page = start; page < end; page++)
{
try
{
sb.Append(PdfTextExtractor.GetTextFromPage(reader, page, strategy));
}
catch (Exception ex)
{
throw new PdfException(ex.Message, ex.InnerException);
}
var p = new Page { Number = page, Content = sb.ToString()};
sb.Clear();
PageParsed?.Invoke(this, new PdfEventArgs<Page>(p));
}
FileParsed?.Invoke(this, new PdfEventArgs<string>(string.IsNullOrEmpty(Name) ? "File parsed" : Name));
}
strategy
对象保持状态,因此您必须像这样在循环内移动对象实例化:
StringBuilder sb = new StringBuilder();
for (int page = start; page < end; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
try
{
sb.Append(PdfTextExtractor.GetTextFromPage(reader, page, strategy));
}
catch (Exception ex)
{
throw new PdfException(ex.Message, ex.InnerException);
}
var p = new Page { Number = page, Content = sb.ToString()};
sb.Clear();
PageParsed?.Invoke(this, new PdfEventArgs<Page>(p));
}
这将解决您的问题。