如何在 tabula 命令行中指定列坐标
How to specify the column coordinates in tabula command line
我想要来自 PDF 的 table 数据,我正在使用以下命令获取 table 数据
java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -t example.pdf
但是在这种情况下,两列数据在某些行中混合,
所以我想指定列坐标以获得完美的数据,
但我不知道如何获得列坐标,
所以任何人都可以用完美的命令指导我会有所帮助。
提前致谢!
您可以使用 -c 或 --columns 参数指定列坐标。您指定的坐标将是列之间的轮廓线的坐标。因此,如果一列从 10.5 变为 13.5,而下一列从 13.5 变为 17.5,那么您只列出 13.5。您还需要关闭猜测。您没有提供示例 pdf,所以我无法为您提供正确的坐标,但您的命令看起来像这样:
java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -c 15.7,17.3,19.2,33.2,70.1,100.7,200.6,300.7 -t example.pdf -g False
您可以从帮助命令中阅读更多有关正确获取命令的不同选项的信息:
$ java -jar target/tabula-1.0.1-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f
<FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r]
[-s <PASSWORD>] [-t] [-u] [-v]
Tabula helps you extract tables from PDFs
-a,--area <AREA> Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire
page
-b,--batch <DIRECTORY> Convert all .pdfs in the provided directory.
-c,--columns <COLUMNS> X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
-d,--debug Print detected table areas instead of
processing.
-f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV
-g,--guess Guess the portion of the page to analyze per
page.
-h,--help Print this help text.
-i,--silent Suppress all stderr output.
-l,--lattice Force PDF to be extracted using lattice-mode
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF
not to be extracted using spreadsheet-style
extraction (if there are no ruling lines
separating each cell)
-o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.
Default: -
-p,--pages <PAGES> Comma separated list of ranges, or all.
Examples: --pages 1-3,5-7, --pages 3 or
--pages all. Default is --pages 1
-r,--spreadsheet [Deprecated in favor of -l/--lattice] Force
PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-s,--password <PASSWORD> Password to decrypt document. Default is empty
-t,--stream Force PDF to be extracted using stream-mode
extraction (if there are no ruling lines
separating each cell)
-u,--use-line-returns Use embedded line returns in cells. (Only in
spreadsheet mode.)
-v,--version Print version and exit.
我想要来自 PDF 的 table 数据,我正在使用以下命令获取 table 数据
java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -t example.pdf
但是在这种情况下,两列数据在某些行中混合, 所以我想指定列坐标以获得完美的数据, 但我不知道如何获得列坐标, 所以任何人都可以用完美的命令指导我会有所帮助。
提前致谢!
您可以使用 -c 或 --columns 参数指定列坐标。您指定的坐标将是列之间的轮廓线的坐标。因此,如果一列从 10.5 变为 13.5,而下一列从 13.5 变为 17.5,那么您只列出 13.5。您还需要关闭猜测。您没有提供示例 pdf,所以我无法为您提供正确的坐标,但您的命令看起来像这样:
java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -c 15.7,17.3,19.2,33.2,70.1,100.7,200.6,300.7 -t example.pdf -g False
您可以从帮助命令中阅读更多有关正确获取命令的不同选项的信息:
$ java -jar target/tabula-1.0.1-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f
<FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r]
[-s <PASSWORD>] [-t] [-u] [-v]
Tabula helps you extract tables from PDFs
-a,--area <AREA> Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire
page
-b,--batch <DIRECTORY> Convert all .pdfs in the provided directory.
-c,--columns <COLUMNS> X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
-d,--debug Print detected table areas instead of
processing.
-f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV
-g,--guess Guess the portion of the page to analyze per
page.
-h,--help Print this help text.
-i,--silent Suppress all stderr output.
-l,--lattice Force PDF to be extracted using lattice-mode
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF
not to be extracted using spreadsheet-style
extraction (if there are no ruling lines
separating each cell)
-o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.
Default: -
-p,--pages <PAGES> Comma separated list of ranges, or all.
Examples: --pages 1-3,5-7, --pages 3 or
--pages all. Default is --pages 1
-r,--spreadsheet [Deprecated in favor of -l/--lattice] Force
PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-s,--password <PASSWORD> Password to decrypt document. Default is empty
-t,--stream Force PDF to be extracted using stream-mode
extraction (if there are no ruling lines
separating each cell)
-u,--use-line-returns Use embedded line returns in cells. (Only in
spreadsheet mode.)
-v,--version Print version and exit.