PDF 中的文本对象如何工作?
How do Text Objects in PDF work?
我有一个 PDF 文档,我想尽可能自动删除其中的水印,以便从 pdftotext
获得更好的结果。
用pdftk
解压后,我看到水印几乎是纯文本的:
BT
1 0 0 1 277.40012 755.2005 Tm
0.501961 0.501961 0.501961 rg /R1 gs /R2 8 Tf
[()]TJ
0 0 Td
[(Abc)30(defghi K)30(lm)-40(no)]TJ
-5.423981 -9.600038 Td
[()]TJ
0 0 Td
[(Apr 01, 2017 12:34)]TJ
ET
水印为
Abcdefghi Klmno
Apr 01, 2017 12:34
浏览 Document management — Portable document format(尤其是第 248f 页)后,我发现了以下内容:
BT: Begin Text
Tm: Text matrix - what is that?
x y Td: Move to the start of the next line with an offset of (x, y)
TJ: Text showing
Tf: Text state
ET: End Text
我不明白的是所有的数字以及为什么
[(Abc)30(defghi K)30(lm)-40(no)]TJ
它是否增加了 Abc
和 defghi K
之间的 space 并减少了 lm
和 no
之间的 space (似乎是这样,查看第 259 页的图 46)?用什么单位?
Tf
是做什么的?
有人可以解释一下吗?
部分回答
Tf
font size Tf
设置字体和大小(请参阅第 244 页)
gs
dictName gs
设置图形状态:
(PDF 1.2) Set the specified parameters in the graphics state.
dictName shall be the name of a graphics state parameter
dictionary in the ExtGState subdictionary of the current resource
dictionary (see the next sub-clause).
我不太清楚\R1
是什么意思。
rg
1.0 1.0 0.0 rg % Set nonstroking colour to yellow
因此 0.501961 0.501961 0.501961 rg
将颜色设置为某个灰度值。
文本矩阵
文本矩阵是仿射变换矩阵,如 this answer 中所述。
因此
1 0 0 1 0 0 Tf
没有任何改变。
矩阵 1 0 0 1 277.40012 755.2005 Tm
将文本向右 (?) 移动 277.40012
个文本单位 (?) 并向下移动 755.2005
个文本单位。
What I don't understand is all the numbers and why
[(Abc)30(defghi K)30(lm)-40(no)]TJ
Does it increase the space between Abc
and defghi K
and decrease the space between lm
and no
(seems so, looking at Figure 46 on page 259)?
差不多,正值减少,负值增加,cf. Table 109 – PDF 规范中的文本显示运算符:
array
TJ :
Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. Figure 46 shows an example of the effect of passing offsets to TJ.
这个图有误导性,显然是某些排版程序乱码了作者想要显示的效果。该图的实际来源如下所示:
BT
/T1_2 1 Tf
0 Tc 8.7503 0 0 8.7503 118.989 450.2115 Tm
[([ \()11(A)53(W)57(A)79(Y again\) ] )41(T)43(J)]TJ
40.0016 0 0 40.0015 296.9949 440.2111 Tm
[(A)53(W)57(A)79(Y again)]TJ
8.7503 0 0 8.7503 118.989 403.2097 Tm
[([ \()11(A)9(\) 120 \()-50(W)-55(\) 120 \()11(A)9(\) 95 \()-41(Y again\) ] )41(T)43(J)]TJ
40.0016 0 0 40.0015 296.9949 392.2093 Tm
(AWAY again)Tj
ET
By what unit?
千分之一的文本单位space, cf.上面的引述。
Text space is the coordinate system in which text is shown. It shall be defined by the text matrix, Tm, and the text state parameters Tfs, Th, and Trise, which together shall determine the transformation from text space to user space.
这通常与字形中的单个单位重合 space
What does Tf
do?
根据Table105 – PDF规范中的文本状态运算符
font size Tf :
Set the text font, Tf, to font and the text font size, Tfs, to size. font shall be the name of a font resource in the Font subdictionary of the current resource dictionary; size shall be a number representing a scale factor. There is no initial value for either font or size; they shall be specified explicitly by using Tf before any text is shown.
The only thing I don't understand now is the line
0.501961 0.501961 0.501961 rg /R1 gs /R2 8 Tf
Can you explain that, too?
说明
0.501961 0.501961 0.501961 rg
将填充颜色设置为 RGB 颜色中的中灰色 space。
然后
/R1 gs
从名为 R1 的 ExtGState 资源设置额外的图形状态参数;可能这里定义了一些透明效果。
终于
/R2 8 Tf
将字体设置为由名为 R2 的 Font 资源定义的字体,并将字体大小设置为 8。
我有一个 PDF 文档,我想尽可能自动删除其中的水印,以便从 pdftotext
获得更好的结果。
用pdftk
解压后,我看到水印几乎是纯文本的:
BT
1 0 0 1 277.40012 755.2005 Tm
0.501961 0.501961 0.501961 rg /R1 gs /R2 8 Tf
[()]TJ
0 0 Td
[(Abc)30(defghi K)30(lm)-40(no)]TJ
-5.423981 -9.600038 Td
[()]TJ
0 0 Td
[(Apr 01, 2017 12:34)]TJ
ET
水印为
Abcdefghi Klmno
Apr 01, 2017 12:34
浏览 Document management — Portable document format(尤其是第 248f 页)后,我发现了以下内容:
BT: Begin Text
Tm: Text matrix - what is that?
x y Td: Move to the start of the next line with an offset of (x, y)
TJ: Text showing
Tf: Text state
ET: End Text
我不明白的是所有的数字以及为什么
[(Abc)30(defghi K)30(lm)-40(no)]TJ
它是否增加了 Abc
和 defghi K
之间的 space 并减少了 lm
和 no
之间的 space (似乎是这样,查看第 259 页的图 46)?用什么单位?
Tf
是做什么的?
有人可以解释一下吗?
部分回答
Tf
font size Tf
设置字体和大小(请参阅第 244 页)
gs
dictName gs
设置图形状态:
(PDF 1.2) Set the specified parameters in the graphics state. dictName shall be the name of a graphics state parameter dictionary in the ExtGState subdictionary of the current resource dictionary (see the next sub-clause).
我不太清楚\R1
是什么意思。
rg
1.0 1.0 0.0 rg % Set nonstroking colour to yellow
因此 0.501961 0.501961 0.501961 rg
将颜色设置为某个灰度值。
文本矩阵
文本矩阵是仿射变换矩阵,如 this answer 中所述。
因此
1 0 0 1 0 0 Tf
没有任何改变。
矩阵 1 0 0 1 277.40012 755.2005 Tm
将文本向右 (?) 移动 277.40012
个文本单位 (?) 并向下移动 755.2005
个文本单位。
What I don't understand is all the numbers and why
[(Abc)30(defghi K)30(lm)-40(no)]TJ
Does it increase the space between
Abc
anddefghi K
and decrease the space betweenlm
andno
(seems so, looking at Figure 46 on page 259)?
差不多,正值减少,负值增加,cf. Table 109 – PDF 规范中的文本显示运算符:
array TJ : Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount. Figure 46 shows an example of the effect of passing offsets to TJ.
这个图有误导性,显然是某些排版程序乱码了作者想要显示的效果。该图的实际来源如下所示:
BT
/T1_2 1 Tf
0 Tc 8.7503 0 0 8.7503 118.989 450.2115 Tm
[([ \()11(A)53(W)57(A)79(Y again\) ] )41(T)43(J)]TJ
40.0016 0 0 40.0015 296.9949 440.2111 Tm
[(A)53(W)57(A)79(Y again)]TJ
8.7503 0 0 8.7503 118.989 403.2097 Tm
[([ \()11(A)9(\) 120 \()-50(W)-55(\) 120 \()11(A)9(\) 95 \()-41(Y again\) ] )41(T)43(J)]TJ
40.0016 0 0 40.0015 296.9949 392.2093 Tm
(AWAY again)Tj
ET
By what unit?
千分之一的文本单位space, cf.上面的引述。
Text space is the coordinate system in which text is shown. It shall be defined by the text matrix, Tm, and the text state parameters Tfs, Th, and Trise, which together shall determine the transformation from text space to user space.
这通常与字形中的单个单位重合 space
What does
Tf
do?
根据Table105 – PDF规范中的文本状态运算符
font size Tf : Set the text font, Tf, to font and the text font size, Tfs, to size. font shall be the name of a font resource in the Font subdictionary of the current resource dictionary; size shall be a number representing a scale factor. There is no initial value for either font or size; they shall be specified explicitly by using Tf before any text is shown.
The only thing I don't understand now is the line
0.501961 0.501961 0.501961 rg /R1 gs /R2 8 Tf
Can you explain that, too?
说明
0.501961 0.501961 0.501961 rg
将填充颜色设置为 RGB 颜色中的中灰色 space。
然后
/R1 gs
从名为 R1 的 ExtGState 资源设置额外的图形状态参数;可能这里定义了一些透明效果。
终于
/R2 8 Tf
将字体设置为由名为 R2 的 Font 资源定义的字体,并将字体大小设置为 8。