Camelot:table_area 和 table_regions 没有按预期工作

Camelot: table_area and table_regions do not work as expected

这几天我一直在尝试让 Camelot 在 pdf 页面的特定区域工作,但它一直让我感到困惑。我查看并尝试了文档建议、一些错误报告和 无济于事。我需要一些帮助。

我从文档中拿了一个例子,因为它有多个 table、this one。我修改了原始命令以仅提取两个 table 中的一个,来自:

tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text=' .\n')

至:

tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')

鉴于:

下面是我对上述 pdf 的试用结果:

第一个:在 '35,591,385,343' PDF 区域(顶部 table)使用 table_area

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
                                                    0                                                  1         2         3         4         5         6         7         8         9
0   Program. Represents arrests reported (not char...                                                                                                                                   
1   by the FBI. Some persons may be arrested more ...                                                                                                                                   
2   could represent multiple arrests of the same p...                                                                                                                                   
3                                                                                                            Total                          Male                        Female          
4                                     Offense charged                                                     Under 18  18 years            Under 18  18 years            Under 18  18 years
5                                                                                                  Total     years  and over     Total     years  and over     Total     years  and over
6   Total   . . .  .  .  .  .  . .  . .  . .  . . ...                                          11,062 .6  1,540 .0  9,522 .6  8,263 .3  1,071 .6  7,191 .7  2,799 .2    468 .3  2,330 .9
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...                                             467 .9     69 .1    398 .8    380 .2     56 .5    323 .7     87 .7     12 .6     75 .2
8                             Murder and nonnegligent                                                                                                                                   
9           manslaughter . . . . . . . .. .. .. .. ..                                               10.0       0.9       9.1       9.0       0.9       8.1       1.1         –       1.0
10       Forcible rape . . . . . . . .. .. .. .. .. .                                               17.5       2.6      14.9      17.2       2.5      14.7         –         –         –
11         Robbery . . . .. .. . .. . ... . ... . ...                                              102.1      25.5      76.6      90.0      22.9      67.1      12.1       2.5       9.5
....
34       Disorderly conduct . .. . . . . . .. .. .. .                                              529.5     136.1     393.3     387.1      90.8     296.2     142.4      45.3      97.1
35          Vagrancy . . . .. . . . ... .... .... ...                                               26.6       2.2      24.4      20.9       1.6      19.3       5.7       0.6       5.1
36         All other offenses (except traffic) . . ..                                              306.1     263.4   2,800.8   2,337.1     194.2   2,142.9     727.0      69.2     657.9
37      Suspicion . . . .. . . .. .. .. .. .. .. . ..                                                1.6         –       1.4       1.2         –       1.0         –         –         –
38            Curfew and loitering law violations  ..                                               91.0      91.0       (X)      63.1      63.1       (X)      28.0      28.0       (X)
39        Runaways  . . . . . . . .. .. .. .. .. ....                                               75.8      75.8       (X)      34.0      34.0       (X)      41.8      41.8       (X)
40                                                     – Represents zero. X Not applicable. 1 Buying,...

请注意 table 是两个,它在顶部和底部都包含不需要的文本,这些文本不应位于使用 plot() 选择的区域内。

第二:在同一个 '35,591,385,343' PDF 区域上使用 table_regions,顶部 table

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
                                                    0                                                  1         2         3         4         5         6         7         8         9
0   Program. Represents arrests reported (not char...                                                                                                                                   
1   by the FBI. Some persons may be arrested more ...                                                                                                                                   
2   could represent multiple arrests of the same p...                                                                                                                                   
3                                                                                                            Total                          Male                        Female          
4                                     Offense charged                                                     Under 18  18 years            Under 18  18 years            Under 18  18 years
5                                                                                                  Total     years  and over     Total     years  and over     Total     years  and over
6   Total   . . .  .  .  .  .  . .  . .  . .  . . ...                                          11,062 .6  1,540 .0  9,522 .6  8,263 .3  1,071 .6  7,191 .7  2,799 .2    468 .3  2,330 .9
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...                                             467 .9     69 .1    398 .8    380 .2     56 .5    323 .7     87 .7     12 .6     75 .2
8                             Murder and nonnegligent                                                                                                                                   
9           manslaughter . . . . . . . .. .. .. .. ..                                               10.0       0.9       9.1       9.0       0.9       8.1       1.1         –       1.0
10       Forcible rape . . . . . . . .. .. .. .. .. .                                               17.5       2.6      14.9      17.2       2.5      14.7         –         –         –
11         Robbery . . . .. .. . .. . ... . ... . ...                                              102.1      25.5      76.6      90.0      22.9      67.1      12.1       2.5       9.5
....
34       Disorderly conduct . .. . . . . . .. .. .. .                                              529.5     136.1     393.3     387.1      90.8     296.2     142.4      45.3      97.1
35          Vagrancy . . . .. . . . ... .... .... ...                                               26.6       2.2      24.4      20.9       1.6      19.3       5.7       0.6       5.1
36         All other offenses (except traffic) . . ..                                              306.1     263.4   2,800.8   2,337.1     194.2   2,142.9     727.0      69.2     657.9
37      Suspicion . . . .. . . .. .. .. .. .. .. . ..                                                1.6         –       1.4       1.2         –       1.0         –         –         –
38            Curfew and loitering law violations  ..                                               91.0      91.0       (X)      63.1      63.1       (X)      28.0      28.0       (X)
39        Runaways  . . . . . . . .. .. .. .. .. ....                                               75.8      75.8       (X)      34.0      34.0       (X)      41.8      41.8       (X)
40                                                     – Represents zero. X Not applicable. 1 Buying,... 

只有一个 table,同样的问题,显然是在所选区域之外出现不需要的文本。

第三:在'33,297,386,65'PDF区域(底部table)使用table_area

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
                                                    0                                                  1         2         3         4         5         6         7         8         9
0   Program. Represents arrests reported (not char...                                                                                                                                   
1   by the FBI. Some persons may be arrested more ...                                                                                                                                   
2   could represent multiple arrests of the same p...                                                                                                                                   
3                                                                                                            Total                          Male                        Female          
4                                     Offense charged                                                     Under 18  18 years            Under 18  18 years            Under 18  18 years
5                                                                                                  Total     years  and over     Total     years  and over     Total     years  and over
6   Total   . . .  .  .  .  .  . .  . .  . .  . . ...                                          11,062 .6  1,540 .0  9,522 .6  8,263 .3  1,071 .6  7,191 .7  2,799 .2    468 .3  2,330 .9
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...                                             467 .9     69 .1    398 .8    380 .2     56 .5    323 .7     87 .7     12 .6     75 .2
8                             Murder and nonnegligent                                                                                                                                   
9           manslaughter . . . . . . . .. .. .. .. ..                                               10.0       0.9       9.1       9.0       0.9       8.1       1.1         –       1.0
10       Forcible rape . . . . . . . .. .. .. .. .. .                                               17.5       2.6      14.9      17.2       2.5      14.7         –         –         –
11         Robbery . . . .. .. . .. . ... . ... . ...                                              102.1      25.5      76.6      90.0      22.9      67.1      12.1       2.5       9.5
....
34       Disorderly conduct . .. . . . . . .. .. .. .                                              529.5     136.1     393.3     387.1      90.8     296.2     142.4      45.3      97.1
35          Vagrancy . . . .. . . . ... .... .... ...                                               26.6       2.2      24.4      20.9       1.6      19.3       5.7       0.6       5.1
36         All other offenses (except traffic) . . ..                                              306.1     263.4   2,800.8   2,337.1     194.2   2,142.9     727.0      69.2     657.9
37      Suspicion . . . .. . . .. .. .. .. .. .. . ..                                                1.6         –       1.4       1.2         –       1.0         –         –         –
38            Curfew and loitering law violations  ..                                               91.0      91.0       (X)      63.1      63.1       (X)      28.0      28.0       (X)
39        Runaways  . . . . . . . .. .. .. .. .. ....                                               75.8      75.8       (X)      34.0      34.0       (X)      41.8      41.8       (X)
40                                                     – Represents zero. X Not applicable. 1 Buying,...

它选择了两个 table,显然第一个仍然是最上面的。不需要的文本也有同样的问题,但现在是预期的。

第四:在'33,297,386,65'PDF区域(底部table)使用table_regions

>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
                                                    0           1          2          3               4              5
0                    Table 325. Arrests by Race: 2009                                                                 
1   [Based on Uniform Crime Reporting (UCR) Progra...                                                                 
2   with a total population of 239,839,971 as esti...                                                                 
3                                                                                              American               
4                                     Offense charged                                    Indian/Alaskan  Asian Pacific
5                                                           Total      White      Black          Native       Islander
6   Total  . . . . .  . .  .  . .  .  . . .  .  . ...  10,690,561  7,389,208  3,027,153         150,544        123,656
7   Violent crime   .  .  .  .  .  .  .  . .  . . ...     456,965    268,346    177,766           5,608          5,245
8     Murder and nonnegligent manslaughter . .. ... .       9,739      4,741      4,801             100             97
9   Forcible rape . . . . . . . .. .. .. .. .... ....      16,362     10,644      5,319             169            230
10  Robbery . . . . .. . . . ... . ... . .... .......     100,496     43,039     55,742             726            989
11  Aggravated assault  . . . . . . . .. .. .........     330,368    209,922    111,904           4,613          3,929
....
34  All other offenses (except traffic) . .. .. .....   2,929,217  1,937,221    911,670          43,880         36,446
35  Suspicion . . .. . . . .. .. .. .. .. .. .. .....       1,513        677        828               1              7
36  Curfew and loitering law violations  . .. ... ...      89,578     54,439     33,207             872          1,060
37  Runaways  . . . . . . . .. .. .. .. .. .. .......      73,616     48,343     19,670           1,653          3,950
38           1 Except forcible rape and prostitution.

更好,但它会像上面那样拾取不需要的文本。

我非常重视建议或指点。提前致谢!

table_areas(不是 table_area)关键字参数效果很好,应该使用(我使用 Camelot 0.7.3)。

tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_areas=['35,591,385,343'], pages = '1')

returns:

这似乎是对的。