PdfTextExtractor.GetTextFromPage() returns 空字符串

Question

我正在尝试使用以下代码（使用 iText7 7.2.2）从以下 PDF 中提取文本：

var source = (string)GetHttpResult("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf", new CookieContainer());
var bytes = Encoding.UTF8.GetBytes(source);
var stream = new MemoryStream(bytes);
var reader = new PdfReader(stream);
var doc = new PdfDocument(reader);
var pages = doc.GetNumberOfPages();
var text = PdfTextExtractor.GetTextFromPage(doc.GetPage(1));

在我的浏览器 (Edge 100.0) 中加载 PDF 工作正常。

GetHttpResult() 是一个简单的 HttpClient，它定义了自定义 CookieContainer、自定义 UserAgent，并调用 ReadAsStringAsync()。没什么特别的。

source 具有正确的 PDF 内容，从“%PDF-1.7”开始。

pages 页数正确，为 2。

但是，无论我尝试什么，text 总是空的。

定义一个明确的 TextExtractionStrategy，尝试一些编码，从循环中的所有页面中提取，...，没有关系，text 总是空的，没有任何异常抛出。

我想我没有阅读此 PDF 是如何“意味着”阅读的，但是正确的阅读方式是什么（source 中的内容正确，页数正确，任何地方都没有异常）？

谢谢。

Answer 1

就是这样！感谢 mkl 和 KJ !

我首先将 PDF 下载为字节数组，因此我确定它没有以任何方式被修改。

然后，由于 pdftotext 能够从此 PDF 中提取文本，我搜索了一个能够执行相同操作的 NuGet 包。我测试了将近十个，FreeSpire.PDF 终于做到了！

更新： 其实，FreeSpire.PDF漏掉了一些词，所以我终于找到了PdfPig，能够提取每个词。

使用 PdfPig 的代码：

using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

byte[] bytes;
using (HttpClient client = new())
{
    bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}

List<string> words = new();
using (PdfDocument document = PdfDocument.Open(bytes))
{
    foreach (Page page in document.GetPages())
    {
        foreach (Word word in page.GetWords())
        {
            words.Add(word.Text);
        }
    }
}

string text = string.Join(" ", words);

代码使用 FreeSpire.PDF :

using Spire.Pdf;
using Spire.Pdf.Exporting.Text;

byte[] bytes;
using (HttpClient client = new())
{
    bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}

string text = string.Empty;
SimpleTextExtractionStrategy strategy = new();
using (PdfDocument doc = new())
{
    doc.LoadFromBytes(bytes);
    foreach (PdfPageBase page in doc.Pages)
    {
        text += page.ExtractText(strategy);
    }
}

PdfTextExtractor.GetTextFromPage() returns 空字符串

PdfTextExtractor.GetTextFromPage() returns empty string

c#

itext7