如何提取pdf中的所有数组?
How to extract all arrays in a pdf?
有没有办法使用 python 从 pdf 中的每个数组中提取数据?
我已经测试过 tabula、camelot、pdfplumber,但是 none 可以正确提取所有内容。
一个例子:
我想使用矩阵、数据框...
我应该选择 OCR 以获得更好的识别吗?
编辑:
我正在尝试使用 tabula-py 从 pdf 中检索此 table。
我的脚本:
tables = tabula.read_pdf(filename, pages="3", output_format="dataframe", multiple_tables=True)
print(tables)
输出:
[ amortization (EBITDA) 205 306 263 284 255
0 Operating profit (EBIT) 125 243 207 221 191
1 Net financials (3) (7) (8) (5) (13)
2 Profit for the year before tax 122 247 201 216 178
3 Profit for the year of continuing operations 92 192 154 160 138
4 Profit/loss for the year of discontinued opera... - 3 (14) 5 (134)
5 Profit for the year 92 195 140 165 4
6 NaN NaN NaN NaN NaN NaN
7 STATEMENT OF FINANCIAL POSITION NaN NaN NaN NaN NaN
8 Total assets 1,393 1,444 1,852 1,854 2,022
9 Average invested capital including goodwill 772 736 659 708 914
10 Net working capital 318 314 268 314 279
11 Total equity 723 740 884 833 809
12 Non-controlling interest 10 7 5 4 4
13 Net interest-bearing debt, end of year 17 25 82 52 118
14 NaN NaN NaN NaN NaN NaN
15 STATEMENT OF CASH FLOWS NaN NaN NaN NaN NaN
16 Cash flow from operating activities 175 183 226 264 232
17 Cash flow from investing activities (88) 55 15 (91) (167)
18 Investments in property, plant and equipment (72) (81) (45) (77) (58)
19 Free cash flow 87 238 241 173 65
20 Cash flow from financing activities (79) (319) (172) (109) (35)
21 Net cash flow for the year 8 (81) 69 64 30
22 NaN NaN NaN NaN NaN NaN
23 KEY RATIOS (%) NaN NaN NaN NaN NaN
24 Revenue growth 3.2 1.0 2.9 5.7 5.5
25 Gross margin 55.3 56.8 54.8 57.3 56.6
26 Cost ratio 50.7 47.7 47.0 49.3 48.7
27 EBITDA margin 7.5 11.5 10.0 11.0 10.5
28 EBIT margin 4.5 9.1 7.8 8.6 7.9
29 Tax rate 24.0 22.2 23.2 25.8 22.5
30 Return on equity 12.2 23.5 18.0 19.5 16.9
31 Equity ratio 51.9 51.2 47.5 45.3 40.0
32 Return on invested capital, 12 months trailing... 16.2 33.0 31.4 31.2 20.9
33 Net working capital in proportion to NaN NaN NaN NaN NaN
34 12 months trailing revenue 11.6 11.8 10.2 12.3 11.5
35 Cash conversion 0.7 1.0 1.2 0.8 0.3
36 Financial gearing 2.4 3.4 9.3 6.3 14.6
37 INCOME STATEMENT NaN NaN NaN NaN NaN
38 Revenue 2,749 2,665 2,638 2,563 2,424
39 Gross profit 1,519 1,513 1,446 1,470 1,371
40 NaN NaN NaN NaN NaN NaN
41 SHARE-BASED RATIOS NaN NaN NaN NaN NaN
42 Average number of shares excluding NaN NaN NaN NaN NaN
43 treasury shares, diluted (thousands) 16,639 16,678 16,550 16,447 16,402
44 Share price, end of year, DKK 140.0 172.0 187.5 185.5 122.0
45 Earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
46 Diluted earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
47 Diluted cash flow per share, DKK 10.5 11.0 13.7 18.2 14.2
48 Diluted net asset value per share, DKK 42.9 44.0 53.1 50.3 49.1
49 Diluted price/earnings, DKK 26.4 14.8 22.1 18.7 1,220.0
50 NaN NaN NaN NaN NaN NaN
51 EMPLOYEES NaN NaN NaN NaN NaN
52 Number of employees, calculated as FTEs, end o... 1,186 1,146 1,042 1,047 1,264
53 NUMBER OF STORES (OWN STORES) NaN NaN NaN NaN NaN
54 Retail stores 126 115 95 107 102
55 Concessions 43 42 42 41 42]
它忽略了第一行,我做错了什么?
这是要在第 3 页上测试的 link dl pdf。
在我看来,Camelot 使用 stream flavor 取得了不错的成绩。
import camelot
tables=camelot.read_pdf(YOUR-PDF-PATH, pages='3', flavor='stream')
print(tables[0].df)
给出:
0 DKK million 2016/17 2015/16 2014/15 2013/14 2012/131)
1 INCOME STATEMENT
2 Revenue 2,749 2,665 2,638 2,563 2,424
3 Gross profit 1,519 1,513 1,446 1,470 1,371
4 Operating profit before depreciation and
5 amortization (EBITDA) 205 306 263 284 255
6 Operating profit (EBIT) 125 243 207 221 191
7 Net financials (3) (7) (8) (5) (13)
8 Profit for the year before tax 122 247 201 216 178
9 Profit for the year of continuing operations 92 192 154 160 138
10 Profit/loss for the year of discontinued opera... - 3 (14) 5 (134)
11 Profit for the year 92 195 140 165 4
12 STATEMENT OF FINANCIAL POSITION
13 Total assets 1,393 1,444 1,852 1,854 2,022
14 Average invested capital including goodwill 772 736 659 708 914
15 Net working capital 318 314 268 314 279
16 Total equity 723 740 884 833 809
17 Non-controlling interest 10 7 5 4 4
18 Net interest-bearing debt, end of year 17 25 82 52 118
19 STATEMENT OF CASH FLOWS
20 Cash flow from operating activities 175 183 226 264 232
21 Cash flow from investing activities (88) 55 15 (91) (167)
22 Investments in property, plant and equipment (72) (81) (45) (77) (58)
23 Free cash flow 87 238 241 173 65
24 Cash flow from financing activities (79) (319) (172) (109) (35)
25 Net cash flow for the year 8 (81) 69 64 30
26 KEY RATIOS (%)
27 Revenue growth 3.2 1.0 2.9 5.7 5.5
28 Gross margin 55.3 56.8 54.8 57.3 56.6
29 Cost ratio 50.7 47.7 47.0 49.3 48.7
30 EBITDA margin 7.5 11.5 10.0 11.0 10.5
31 EBIT margin 4.5 9.1 7.8 8.6 7.9
32 Tax rate 24.0 22.2 23.2 25.8 22.5
33 Return on equity 12.2 23.5 18.0 19.5 16.9
34 Equity ratio 51.9 51.2 47.5 45.3 40.0
35 Return on invested capital, 12 months trailing... 16.2 33.0 31.4 31.2 20.9
36 Net working capital in proportion to
37 12 months trailing revenue 11.6 11.8 10.2 12.3 11.5
38 Cash conversion 0.7 1.0 1.2 0.8 0.3
39 Financial gearing 2.4 3.4 9.3 6.3 14.6
40 SHARE-BASED RATIOS
41 Average number of shares excluding
42 treasury shares, diluted (thousands) 16,639 16,678 16,550 16,447 16,402
43 Share price, end of year, DKK 140.0 172.0 187.5 185.5 122.0
44 Earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
45 Diluted earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
46 Diluted cash flow per share, DKK 10.5 11.0 13.7 18.2 14.2
47 Diluted net asset value per share, DKK 42.9 44.0 53.1 50.3 49.1
48 Diluted price/earnings, DKK 26.4 14.8 22.1 18.7 1,220.0
49 EMPLOYEES
50 Number of employees, calculated as FTEs, end o... 1,186 1,146 1,042 1,047 1,264
51 NUMBER OF STORES (OWN STORES)
52 Retail stores 126 115 95 107 102
53 Concessions 43 42 42 41 42
关于Camelot的更多信息,你可以阅读official documentation. In particular, the API reference对你有用
有没有办法使用 python 从 pdf 中的每个数组中提取数据?
我已经测试过 tabula、camelot、pdfplumber,但是 none 可以正确提取所有内容。
一个例子:
我想使用矩阵、数据框...
我应该选择 OCR 以获得更好的识别吗?
编辑:
我正在尝试使用 tabula-py 从 pdf 中检索此 table。
我的脚本:
tables = tabula.read_pdf(filename, pages="3", output_format="dataframe", multiple_tables=True)
print(tables)
输出:
[ amortization (EBITDA) 205 306 263 284 255
0 Operating profit (EBIT) 125 243 207 221 191
1 Net financials (3) (7) (8) (5) (13)
2 Profit for the year before tax 122 247 201 216 178
3 Profit for the year of continuing operations 92 192 154 160 138
4 Profit/loss for the year of discontinued opera... - 3 (14) 5 (134)
5 Profit for the year 92 195 140 165 4
6 NaN NaN NaN NaN NaN NaN
7 STATEMENT OF FINANCIAL POSITION NaN NaN NaN NaN NaN
8 Total assets 1,393 1,444 1,852 1,854 2,022
9 Average invested capital including goodwill 772 736 659 708 914
10 Net working capital 318 314 268 314 279
11 Total equity 723 740 884 833 809
12 Non-controlling interest 10 7 5 4 4
13 Net interest-bearing debt, end of year 17 25 82 52 118
14 NaN NaN NaN NaN NaN NaN
15 STATEMENT OF CASH FLOWS NaN NaN NaN NaN NaN
16 Cash flow from operating activities 175 183 226 264 232
17 Cash flow from investing activities (88) 55 15 (91) (167)
18 Investments in property, plant and equipment (72) (81) (45) (77) (58)
19 Free cash flow 87 238 241 173 65
20 Cash flow from financing activities (79) (319) (172) (109) (35)
21 Net cash flow for the year 8 (81) 69 64 30
22 NaN NaN NaN NaN NaN NaN
23 KEY RATIOS (%) NaN NaN NaN NaN NaN
24 Revenue growth 3.2 1.0 2.9 5.7 5.5
25 Gross margin 55.3 56.8 54.8 57.3 56.6
26 Cost ratio 50.7 47.7 47.0 49.3 48.7
27 EBITDA margin 7.5 11.5 10.0 11.0 10.5
28 EBIT margin 4.5 9.1 7.8 8.6 7.9
29 Tax rate 24.0 22.2 23.2 25.8 22.5
30 Return on equity 12.2 23.5 18.0 19.5 16.9
31 Equity ratio 51.9 51.2 47.5 45.3 40.0
32 Return on invested capital, 12 months trailing... 16.2 33.0 31.4 31.2 20.9
33 Net working capital in proportion to NaN NaN NaN NaN NaN
34 12 months trailing revenue 11.6 11.8 10.2 12.3 11.5
35 Cash conversion 0.7 1.0 1.2 0.8 0.3
36 Financial gearing 2.4 3.4 9.3 6.3 14.6
37 INCOME STATEMENT NaN NaN NaN NaN NaN
38 Revenue 2,749 2,665 2,638 2,563 2,424
39 Gross profit 1,519 1,513 1,446 1,470 1,371
40 NaN NaN NaN NaN NaN NaN
41 SHARE-BASED RATIOS NaN NaN NaN NaN NaN
42 Average number of shares excluding NaN NaN NaN NaN NaN
43 treasury shares, diluted (thousands) 16,639 16,678 16,550 16,447 16,402
44 Share price, end of year, DKK 140.0 172.0 187.5 185.5 122.0
45 Earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
46 Diluted earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
47 Diluted cash flow per share, DKK 10.5 11.0 13.7 18.2 14.2
48 Diluted net asset value per share, DKK 42.9 44.0 53.1 50.3 49.1
49 Diluted price/earnings, DKK 26.4 14.8 22.1 18.7 1,220.0
50 NaN NaN NaN NaN NaN NaN
51 EMPLOYEES NaN NaN NaN NaN NaN
52 Number of employees, calculated as FTEs, end o... 1,186 1,146 1,042 1,047 1,264
53 NUMBER OF STORES (OWN STORES) NaN NaN NaN NaN NaN
54 Retail stores 126 115 95 107 102
55 Concessions 43 42 42 41 42]
它忽略了第一行,我做错了什么?
这是要在第 3 页上测试的 link dl pdf。
在我看来,Camelot 使用 stream flavor 取得了不错的成绩。
import camelot
tables=camelot.read_pdf(YOUR-PDF-PATH, pages='3', flavor='stream')
print(tables[0].df)
给出:
0 DKK million 2016/17 2015/16 2014/15 2013/14 2012/131)
1 INCOME STATEMENT
2 Revenue 2,749 2,665 2,638 2,563 2,424
3 Gross profit 1,519 1,513 1,446 1,470 1,371
4 Operating profit before depreciation and
5 amortization (EBITDA) 205 306 263 284 255
6 Operating profit (EBIT) 125 243 207 221 191
7 Net financials (3) (7) (8) (5) (13)
8 Profit for the year before tax 122 247 201 216 178
9 Profit for the year of continuing operations 92 192 154 160 138
10 Profit/loss for the year of discontinued opera... - 3 (14) 5 (134)
11 Profit for the year 92 195 140 165 4
12 STATEMENT OF FINANCIAL POSITION
13 Total assets 1,393 1,444 1,852 1,854 2,022
14 Average invested capital including goodwill 772 736 659 708 914
15 Net working capital 318 314 268 314 279
16 Total equity 723 740 884 833 809
17 Non-controlling interest 10 7 5 4 4
18 Net interest-bearing debt, end of year 17 25 82 52 118
19 STATEMENT OF CASH FLOWS
20 Cash flow from operating activities 175 183 226 264 232
21 Cash flow from investing activities (88) 55 15 (91) (167)
22 Investments in property, plant and equipment (72) (81) (45) (77) (58)
23 Free cash flow 87 238 241 173 65
24 Cash flow from financing activities (79) (319) (172) (109) (35)
25 Net cash flow for the year 8 (81) 69 64 30
26 KEY RATIOS (%)
27 Revenue growth 3.2 1.0 2.9 5.7 5.5
28 Gross margin 55.3 56.8 54.8 57.3 56.6
29 Cost ratio 50.7 47.7 47.0 49.3 48.7
30 EBITDA margin 7.5 11.5 10.0 11.0 10.5
31 EBIT margin 4.5 9.1 7.8 8.6 7.9
32 Tax rate 24.0 22.2 23.2 25.8 22.5
33 Return on equity 12.2 23.5 18.0 19.5 16.9
34 Equity ratio 51.9 51.2 47.5 45.3 40.0
35 Return on invested capital, 12 months trailing... 16.2 33.0 31.4 31.2 20.9
36 Net working capital in proportion to
37 12 months trailing revenue 11.6 11.8 10.2 12.3 11.5
38 Cash conversion 0.7 1.0 1.2 0.8 0.3
39 Financial gearing 2.4 3.4 9.3 6.3 14.6
40 SHARE-BASED RATIOS
41 Average number of shares excluding
42 treasury shares, diluted (thousands) 16,639 16,678 16,550 16,447 16,402
43 Share price, end of year, DKK 140.0 172.0 187.5 185.5 122.0
44 Earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
45 Diluted earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
46 Diluted cash flow per share, DKK 10.5 11.0 13.7 18.2 14.2
47 Diluted net asset value per share, DKK 42.9 44.0 53.1 50.3 49.1
48 Diluted price/earnings, DKK 26.4 14.8 22.1 18.7 1,220.0
49 EMPLOYEES
50 Number of employees, calculated as FTEs, end o... 1,186 1,146 1,042 1,047 1,264
51 NUMBER OF STORES (OWN STORES)
52 Retail stores 126 115 95 107 102
53 Concessions 43 42 42 41 42
关于Camelot的更多信息,你可以阅读official documentation. In particular, the API reference对你有用