使用 iTextSharp 的 LocationTextExtractionStrategy 时如何避免 textchunk 中的错误值？

Question

我多年来一直在使用 iTextSharp 库，使用 LocationTextExtractionStrategy 的扩展名从 PDF 文件中提取文本。它给了我所有的词和它们的位置。

但是现在，在一个新的 PDF（使用 iText 1.4.3 生成）中，我有一些来自同一行的块，如您在图像示例中所见。

Text: S startLocation x:122 y:110.64 z:1 endLocation  x:126.8 y:125.04 z:1
Text: e startLocation x:126.8 y:110.64 z:1 endLocation  x:131.6 y:125.04 z:1
Text: x startLocation x:131.6 y:110.64 z:1 endLocation  x:136.4 y:125.04 z:1
Text: L startLocation x:122 y:135.3 z:1 endLocation  x:126.8 y:226.5 z:1
Text: a startLocation x:126.8 y:135.3 z:1 endLocation  x:131.6 y:226.5 z:1
Text: s startLocation x:131.6 y:135.3 z:1 endLocation  x:136.4 y:226.5 z:1
Text: t startLocation x:136.4 y:135.3 z:1 endLocation  x:141.2 y:226.5 z:1
Text: n startLocation x:141.2 y:135.3 z:1 endLocation  x:146 y:226.5 z:1
Text: a startLocation x:146 y:135.3 z:1 endLocation  x:150.8 y:226.5 z:1
Text: m startLocation x:150.8 y:135.3 z:1 endLocation  x:155.6 y:226.5 z:1
Text: e startLocation x:155.6 y:135.3 z:1 endLocation  x:160.4 y:226.5 z:1

在生成 textchunck 之前它给我：

S|distParallelStart 143.5421|distParallelEnd 158.7211| distPerpendicular 81 | orientationMagnitude 1249|orientationVector 0,3162279,  0,9486833, 0
e|distParallelStart 145.06  |distParallelEnd 160.239 | distPerpendicular 85 | orientationMagnitude 1249|orientationVector 0,3162279,  0,9486833, 0
x|distParallelStart 146.5779|distParallelEnd 161.7569| distPerpendicular 90 | orientationMagnitude 1249|orientationVector 0,3162279,  0,9486833, 0
L|distParallelStart 141.5252|distParallelEnd 232.8514| distPerpendicular 115| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
a|distParallelStart 141.7775|distParallelEnd 233.1037| distPerpendicular 120| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
s|distParallelStart 142.0297|distParallelEnd 233.356 | distPerpendicular 124| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
t|distParallelStart 142.282 |distParallelEnd 233.6083| distPerpendicular 129| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
n|distParallelStart 142.5343|distParallelEnd 233.8605| distPerpendicular 134| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
a|distParallelStart 142.7866|distParallelEnd 234.1128| distPerpendicular 139| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
m|distParallelStart 143.0389|distParallelEnd 234.3651| distPerpendicular 143| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0
e|distParallelStart 143.2912|distParallelEnd 234.6174| distPerpendicular 148| orientationMagnitude 1518|orientationVector 0,05255886, 0,9986178, 0

如果两个chunk在同一行的代码return false（因为distPerpendicular不同：

 virtual public bool SameLine(TextChunk a){
   if (orientationMagnitude != a.orientationMagnitude) return false;
   if (distPerpendicular != a.distPerpendicular) return false;
   return true;
 }

distPerpendicular 在 TextChunk 中计算 class:

public TextChunk(String str, Vector startLocation, Vector endLocation, float charSpaceWidth) {
    this.text = str;
    this.startLocation = startLocation;
    this.endLocation = endLocation;
    this.charSpaceWidth = charSpaceWidth;

    Vector oVector = endLocation.Subtract(startLocation);
    if (oVector.Length == 0) {
        oVector = new Vector(1, 0, 0);
    }
    orientationVector = oVector.Normalize();
    orientationMagnitude = (int)(Math.Atan2(orientationVector[Vector.I2], orientationVector[Vector.I1])*1000);

    // see http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
    // the two vectors we are crossing are in the same plane, so the result will be purely
    // in the z-axis (out of plane) direction, so we just take the I3 component of the result
    Vector origin = new Vector(0,0,1);
    distPerpendicular = (int)(startLocation.Subtract(origin)).Cross(orientationVector)[Vector.I3];

    distParallelStart = orientationVector.Dot(startLocation);
    distParallelEnd = orientationVector.Dot(endLocation);
}

如果我执行 locationalResult.Sort() 夹头会与文档中的其他夹头混合在一起，因为数据看起来没有顺序。在其他工作的 PDF 中，有 orientationVector (1,0,0)。区别在于 startLocation 和 endLocation 没有相同的 y 因子。似乎与身高有关。有人可以向我解释什么是错的吗？如何更正值以获取同一行中的所有字符？

Examplepage

Answer 1

文档是横向的，块具有相同的 X 组件，但 Y 的变化如下： enter image description here 只有我必须改变 X 和 Y 坐标才能工作

        Function GetCharacterRenderInfos() As List(Of CustomTextRenderInfo)
        Dim baseList As IList(Of TextRenderInfo) = Me.BaseInfo.GetCharacterRenderInfos()
        Dim caracteres() As Char = Me.GetText().ToCharArray()
        Dim vStart As Vector = Me.BaseLine.GetStartPoint()
        Dim vEnd As Vector = Me.BaseLine.GetEndPoint()

        Dim x As Single = vStart(Vector.I1)
        Dim y As Single = vStart(Vector.I2)
        Dim z As Single = vStart(Vector.I3)

        Dim y2 As Single = vEnd(Vector.I2)
        If (x.Equals(vEnd(Vector.I1))) Then 'This case
            x = vStart(Vector.I2)
            y = 2000 - vStart(Vector.I1) 'Because the rigthmost column must be on top
            y2 = 2000 - vEnd(Vector.I1)
        End If


        If x < 0 And y > 0 Then
            x = 0
        End If

也许是另一种解决方案，但这对我有用。再次感谢。

使用 iTextSharp 的 LocationTextExtractionStrategy 时如何避免 textchunk 中的错误值？

How to avoid bad values in textchunk when use LocationTextExtractionStrategy from iTextSharp?

c#

pdf

itextsharp