PDF-文本块可以包含 2 个或更多单词吗？

Question

我正在使用 LocationTextExtractionStrategy 从 PDF 呈现文本。文本在名为 RenderText 的函数中呈现。所以我的问题是：一个块可以包含 2 个以上的单词吗？例如我们有文本： 'MKL is a helpfull person' 它可以像这样写成块吗（最重要的块以粗体显示）： MK

L

是h

elpfull

每个儿子

?

下面是我用于分词的代码。我在将文本（来自 renderText 函数的块）添加到当前行的过程中进行单词分离。

 public class TextLineLocation
{
    public float X { get; set; }
    public float Y { get; set; }
    public float Height { get; set; }
    public float Width { get; set; }
    private string Text;
    private List<char> bannedSings = new List<char>() {' ',',', '.', '/', '|', Convert.ToChar(@"\"), ';', '(', ')', '*', '&', '^', '!','?' };
    public void AddText(TextInfo text)
    {
        Text += text;
        foreach (char sign in bannedSings)
        {
            //creating new word
            if (text.textChunk.Text.Contains(sign))
            {
                string[] splittedText = text.textChunk.Text.Split(sign);
                foreach (string val in splittedText)
                {
                    //if its first element, add it to current word
                    if (splittedText[0] == val)
                    {
                        // if its space, just ignore...
                        if (splittedText[0] == " ")
                        {
                            continue;
                        }
                        wordList[wordList.Count - 1].Text += val;
                        wordList[wordList.Count - 1].Width += text.getFontWidth();
                        wordList[wordList.Count - 1].Height += text.getFontHeight();
                    }
                    else
                    {
                        //if it isnt a first element, create another word
                        wordList.Add(new WordLocation(text.textChunk.StartLocation[1], text.textChunk.StartLocation[0], text.getFontWidth(), text.getFontHeight(), val));
                        //TODO: what if chunk has more than 2 words separated ?
                    }
                }
            }
        }
        else
        {
            //update last word
            wordList[wordList.Count-1].Text += text.textChunk.Text;
            wordList[wordList.Count - 1].Width += text.getFontWidth();
            wordList[wordList.Count - 1].Height += text.getFontHeight();
        }
    }
    public List<WordLocation> wordList = new List<WordLocation>();


}

Answer 1

不确定 LocationTextExtractionStrategy 来自哪个库，或者它到底做了什么，但在 PDF 表示本身中，您可以将字符组合在一起 "chunk"。

如何使用它完全取决于生成 PDF 的程序：一些程序将单词放在一起，一些程序只对单词片段进行分组（例如用于字距调整），一些程序做其他随机的事情。

因此，如果 LocationTextExtractionStrategy return将这些作为块，您不能依赖任何东西。如果 LocationTextExtractionStrategy 没有return这些，而是依赖于间距启发式将字符分组为块，那么这将和启发式一样好。

底线：PDF 不包含文本，但包含字形及其在页面上的位置。试图从中重建文本是并且仍然是猜测。在大多数情况下，您可能会使用它，但无论您做什么都会失败，总会有 PDF。

PDF-文本块可以包含 2 个或更多单词吗？

PDF- Can text chunk contains 2 or more words?

pdf

itext

chunks