从 pdffonts 的命令输出中仅获取第三列和第六列

Get only third and sixth column from command output of pdffonts

我正在使用 poppler pdffonts 获取 pdf 文档中的字体。 下面是示例输出

$ pdffonts "some.pdf"
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TimesNewRoman                        TrueType          WinAnsi          no  no  no      36  0
TimesNewRoman,Bold                   TrueType          WinAnsi          no  no  no      38  0
EDMFMD+Symbol                        CID TrueType      Identity-H       yes yes yes     41  0
Arial                                TrueType          WinAnsi          no  no  no      43  0
Arial,Bold                           TrueType          WinAnsi          no  no  no      16  0

现在我只想在上面的输出中获取 "encoding" 和 "uni" 列值。但是我无法获取,因为每行不一致space。

尝试过的方法(Python):

1) 按 space 拆分每行并按 space 连接然后拆分,以便结果列表中索引 2 和 5 的元素将为我提供每行所需的值。由于行值中 spaces,此方法失败。

代码示例:

for line in os.popen("pdffonts some.pdf").readlines():
    print ' '.join(line.split()).split()

输出:

['name', 'type', 'encoding', 'emb', 'sub', 'uni', 'object', 'ID']
['------------------------------------', '-----------------', '----------------', '---', '---', '---', '---------']
['FMGLMO+MyriadPro-Bold', 'Type', '1C', 'Custom', 'yes', 'yes', 'yes', '127', '0']
['FMGMMM+MyriadPro-Semibold', 'Type', '1C', 'Custom', 'yes', 'yes', 'yes', '88', '0']
['Arial-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '90', '0']
['TimesNewRomanPSMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '92', '0']
['FMGMHL+TimesNewRomanPSMT', 'CID', 'TrueType', 'Identity-H', 'yes', 'yes', 'no', '95', '0']
['FMHBEE+Arial-BoldMT', 'CID', 'TrueType', 'Identity-H', 'yes', 'yes', 'no', '100', '0']
['TimesNewRomanPS-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '103', '0']

2) 使用正则表达式将输出的每一行拆分为至少两个 space。这种方法失败了,因为现在我无法得到索引 5,因为只有一个 space 存在。

代码示例:

for line in os.popen("pdffonts some.pdf").readlines():
    print re.split(r'\s{2,}', line.strip())

输出:

['name', 'type', 'encoding', 'emb sub uni object ID']
['------------------------------------ ----------------- ---------------- --- --- --- ---------']
['FMGLMO+MyriadPro-Bold', 'Type 1C', 'Custom', 'yes yes yes', '127', '0']
['FMGMMM+MyriadPro-Semibold', 'Type 1C', 'Custom', 'yes yes yes', '88', '0']
['Arial-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '90', '0']
['TimesNewRomanPSMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '92', '0']
['FMGMHL+TimesNewRomanPSMT', 'CID TrueType', 'Identity-H', 'yes yes no', '95', '0']
['FMHBEE+Arial-BoldMT', 'CID TrueType', 'Identity-H', 'yes yes no', '100', '0']
['TimesNewRomanPS-BoldMT', 'TrueType', 'WinAnsi', 'no', 'no', 'no', '103', '0']

AWK: 失败,因为 space issue.Please 与原始输出比较以获得差异。

$ pdffonts "some.pdf"|awk '{print }'

encoding
----------------
WinAnsi
WinAnsi
TrueType
WinAnsi
WinAnsi

使用 GNU awk:

awk -v FIELDWIDTHS='36 1:17 1:16 1:3 1:3 1:3 1:9' '{ print ,  }' file
encoding         uni
---------------- ---
WinAnsi          no
WinAnsi          no
Identity-H       yes
WinAnsi          no
WinAnsi          no

来自man gawk

FIELDWIDTHS

A whitespace-separated list of field widths. When set, gawk parses the input into fields of fixed width, instead of using the value of the FS variable as the field separator. Each field width may optionally be preceded by a colon-separated value specifying the number of characters to skip before the field starts...

您可以使用列名称下方的破折号来确定截线的位置。

我们可以方便地在第二行找到连续的"------",并在每个破折号序列的开头和结尾截取列(以" -"开头,以"- "结尾) .

我写了函数get_column,它根据列名找到它。

import os

lines_in = os.popen("pdffonts some.pdf")
# read the column names
header = lines_in.readline();

# read the: --------...
column_dashes = lines_in.readline()

# find column starts and ends
column_starts = [0]
pos = 0
while True:
  pos = column_dashes.find(" -", pos)
  if pos == -1:
    break
  column_starts.append(pos+1)
  pos += 1

column_ends = []
pos = 0
while True:
  pos = column_dashes.find("- ", pos)
  if pos == -1:
    column_ends.append(len(column_dashes))
    break
  column_ends.append(pos+1)
  pos += 1

def get_column( line, name ):
  n = columns[name]
  return line[column_starts[n]:column_ends[n]].strip()

# get column names
columns = {}
for n in range(len(column_starts)):
  columns[ header[column_starts[n]:column_ends[n]].strip() ] = n

# read rest of the table
for line in lines_in.readlines():
  print( (get_column(line,"encoding"), get_column(line, "uni")) )

结果:

('WinAnsi', 'no')
('WinAnsi', 'no')
('Identity-H', 'yes')
('WinAnsi', 'no')
('WinAnsi', 'no')

您可以收集每个所需列的字符串位置:

with open('pdffonts.txt') as f:
    header =f.readline()
    read_data = f.read()
f.closed

header_values = header.split()

positions = {}
for name in header_values:
    positions[name] = header.index(name)
print(positions)

这将为您提供以下示例字典:

{'name': 0, 'type': 37, 'encoding': 55, 'emb': 72, 'sub': 76, 'uni': 80, 'object': 84, 'ID': 91}

之后可以指定要提取的子串范围:

desired_columns = []
for line in read_data.splitlines()[1:]:
    encoding = line[positions['encoding']:positions['emb']].strip()
    uni = line[positions['uni']:positions['object']].strip()
    desired_columns.append([encoding,uni])

print(desired_columns)

结果:

[['WinAnsi', 'no'], ['WinAnsi', 'no'], ['Identity-H', 'yes'], ['WinAnsi', 'no'], ['WinAnsi', 'no']]

同样使用 Perl,你可以像下面那样做

> cat some.pdf
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
TimesNewRoman                        TrueType          WinAnsi          no  no  no      36  0
TimesNewRoman,Bold                   TrueType          WinAnsi          no  no  no      38  0
EDMFMD+Symbol                        CID TrueType      Identity-H       yes yes yes     41  0
Arial                                TrueType          WinAnsi          no  no  no      43  0
Arial,Bold                           TrueType          WinAnsi          no  no  no      16  0
> perl -lane ' $enc=@F==9? $F[3]:$F[2]; print "$enc\t\t$F[-3]" ' some.pdf
encoding                uni
----------------                ---
WinAnsi         no
WinAnsi         no
Identity-H              yes
WinAnsi         no
WinAnsi         no
>