Camelot:table_area 和 table_regions 没有按预期工作
Camelot: table_area and table_regions do not work as expected
这几天我一直在尝试让 Camelot 在 pdf 页面的特定区域工作,但它一直让我感到困惑。我查看并尝试了文档建议、一些错误报告和 无济于事。我需要一些帮助。
我从文档中拿了一个例子,因为它有多个 table、this one。我修改了原始命令以仅提取两个 table 中的一个,来自:
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text=' .\n')
至:
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
鉴于:
- 我更改了正则表达式,因为它消除了单词之间的空格,
- 使用
table_area
而不是文档的 table_areas
因为前者触发了详细说明,而第二个是错误(解释了错误 here,并且文档似乎仍然错了)
- 尝试提取两个 tables 并使用 docs here 中解释的 camelot 的绘图功能检查了各自的区域,所以他们 应该 是正确的,
- 也尝试使用
table_regions
,至少它拉出一个 table 而不是两个,但它仍然相当不准确(见下面的评论)
下面是我对上述 pdf 的试用结果:
第一个:在 '35,591,385,343'
PDF 区域(顶部 table)使用 table_area
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
请注意 table 是两个,它在顶部和底部都包含不需要的文本,这些文本不应位于使用 plot()
选择的区域内。
第二:在同一个 '35,591,385,343'
PDF 区域上使用 table_regions
,顶部 table
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
只有一个 table,同样的问题,显然是在所选区域之外出现不需要的文本。
第三:在'33,297,386,65'
PDF区域(底部table)使用table_area
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
它选择了两个 table,显然第一个仍然是最上面的。不需要的文本也有同样的问题,但现在是预期的。
第四:在'33,297,386,65'
PDF区域(底部table)使用table_regions
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
0 1 2 3 4 5
0 Table 325. Arrests by Race: 2009
1 [Based on Uniform Crime Reporting (UCR) Progra...
2 with a total population of 239,839,971 as esti...
3 American
4 Offense charged Indian/Alaskan Asian Pacific
5 Total White Black Native Islander
6 Total . . . . . . . . . . . . . . . . ... 10,690,561 7,389,208 3,027,153 150,544 123,656
7 Violent crime . . . . . . . . . . . ... 456,965 268,346 177,766 5,608 5,245
8 Murder and nonnegligent manslaughter . .. ... . 9,739 4,741 4,801 100 97
9 Forcible rape . . . . . . . .. .. .. .. .... .... 16,362 10,644 5,319 169 230
10 Robbery . . . . .. . . . ... . ... . .... ....... 100,496 43,039 55,742 726 989
11 Aggravated assault . . . . . . . .. .. ......... 330,368 209,922 111,904 4,613 3,929
....
34 All other offenses (except traffic) . .. .. ..... 2,929,217 1,937,221 911,670 43,880 36,446
35 Suspicion . . .. . . . .. .. .. .. .. .. .. ..... 1,513 677 828 1 7
36 Curfew and loitering law violations . .. ... ... 89,578 54,439 33,207 872 1,060
37 Runaways . . . . . . . .. .. .. .. .. .. ....... 73,616 48,343 19,670 1,653 3,950
38 1 Except forcible rape and prostitution.
更好,但它会像上面那样拾取不需要的文本。
我非常重视建议或指点。提前致谢!
table_areas(不是 table_area)关键字参数效果很好,应该使用(我使用 Camelot 0.7.3)。
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_areas=['35,591,385,343'], pages = '1')
returns:
这似乎是对的。
这几天我一直在尝试让 Camelot 在 pdf 页面的特定区域工作,但它一直让我感到困惑。我查看并尝试了文档建议、一些错误报告和
我从文档中拿了一个例子,因为它有多个 table、this one。我修改了原始命令以仅提取两个 table 中的一个,来自:
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text=' .\n')
至:
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
鉴于:
- 我更改了正则表达式,因为它消除了单词之间的空格,
- 使用
table_area
而不是文档的table_areas
因为前者触发了详细说明,而第二个是错误(解释了错误 here,并且文档似乎仍然错了) - 尝试提取两个 tables 并使用 docs here 中解释的 camelot 的绘图功能检查了各自的区域,所以他们 应该 是正确的,
- 也尝试使用
table_regions
,至少它拉出一个 table 而不是两个,但它仍然相当不准确(见下面的评论)
下面是我对上述 pdf 的试用结果:
第一个:在 '35,591,385,343'
PDF 区域(顶部 table)使用 table_area
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
请注意 table 是两个,它在顶部和底部都包含不需要的文本,这些文本不应位于使用 plot()
选择的区域内。
第二:在同一个 '35,591,385,343'
PDF 区域上使用 table_regions
,顶部 table
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['35,591,385,343'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
只有一个 table,同样的问题,显然是在所选区域之外出现不需要的文本。
第三:在'33,297,386,65'
PDF区域(底部table)使用table_area
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=2>
>>> tables[0].df
0 1 2 3 4 5 6 7 8 9
0 Program. Represents arrests reported (not char...
1 by the FBI. Some persons may be arrested more ...
2 could represent multiple arrests of the same p...
3 Total Male Female
4 Offense charged Under 18 18 years Under 18 18 years Under 18 18 years
5 Total years and over Total years and over Total years and over
6 Total . . . . . . . . . . . . . . . ... 11,062 .6 1,540 .0 9,522 .6 8,263 .3 1,071 .6 7,191 .7 2,799 .2 468 .3 2,330 .9
7 Violent crime . . . . . . . . . . . ... 467 .9 69 .1 398 .8 380 .2 56 .5 323 .7 87 .7 12 .6 75 .2
8 Murder and nonnegligent
9 manslaughter . . . . . . . .. .. .. .. .. 10.0 0.9 9.1 9.0 0.9 8.1 1.1 – 1.0
10 Forcible rape . . . . . . . .. .. .. .. .. . 17.5 2.6 14.9 17.2 2.5 14.7 – – –
11 Robbery . . . .. .. . .. . ... . ... . ... 102.1 25.5 76.6 90.0 22.9 67.1 12.1 2.5 9.5
....
34 Disorderly conduct . .. . . . . . .. .. .. . 529.5 136.1 393.3 387.1 90.8 296.2 142.4 45.3 97.1
35 Vagrancy . . . .. . . . ... .... .... ... 26.6 2.2 24.4 20.9 1.6 19.3 5.7 0.6 5.1
36 All other offenses (except traffic) . . .. 306.1 263.4 2,800.8 2,337.1 194.2 2,142.9 727.0 69.2 657.9
37 Suspicion . . . .. . . .. .. .. .. .. .. . .. 1.6 – 1.4 1.2 – 1.0 – – –
38 Curfew and loitering law violations .. 91.0 91.0 (X) 63.1 63.1 (X) 28.0 28.0 (X)
39 Runaways . . . . . . . .. .. .. .. .. .... 75.8 75.8 (X) 34.0 34.0 (X) 41.8 41.8 (X)
40 – Represents zero. X Not applicable. 1 Buying,...
它选择了两个 table,显然第一个仍然是最上面的。不需要的文本也有同样的问题,但现在是预期的。
第四:在'33,297,386,65'
PDF区域(底部table)使用table_regions
>>> tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_regions=['33,297,386,65'], pages = '1')
>>> tables
<TableList n=1>
>>> tables[0].df
0 1 2 3 4 5
0 Table 325. Arrests by Race: 2009
1 [Based on Uniform Crime Reporting (UCR) Progra...
2 with a total population of 239,839,971 as esti...
3 American
4 Offense charged Indian/Alaskan Asian Pacific
5 Total White Black Native Islander
6 Total . . . . . . . . . . . . . . . . ... 10,690,561 7,389,208 3,027,153 150,544 123,656
7 Violent crime . . . . . . . . . . . ... 456,965 268,346 177,766 5,608 5,245
8 Murder and nonnegligent manslaughter . .. ... . 9,739 4,741 4,801 100 97
9 Forcible rape . . . . . . . .. .. .. .. .... .... 16,362 10,644 5,319 169 230
10 Robbery . . . . .. . . . ... . ... . .... ....... 100,496 43,039 55,742 726 989
11 Aggravated assault . . . . . . . .. .. ......... 330,368 209,922 111,904 4,613 3,929
....
34 All other offenses (except traffic) . .. .. ..... 2,929,217 1,937,221 911,670 43,880 36,446
35 Suspicion . . .. . . . .. .. .. .. .. .. .. ..... 1,513 677 828 1 7
36 Curfew and loitering law violations . .. ... ... 89,578 54,439 33,207 872 1,060
37 Runaways . . . . . . . .. .. .. .. .. .. ....... 73,616 48,343 19,670 1,653 3,950
38 1 Except forcible rape and prostitution.
更好,但它会像上面那样拾取不需要的文本。
我非常重视建议或指点。提前致谢!
table_areas(不是 table_area)关键字参数效果很好,应该使用(我使用 Camelot 0.7.3)。
tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_areas=['35,591,385,343'], pages = '1')
returns:
这似乎是对的。