如何读取RC文件内容
How to read RC file contents
我已经将一个文件加载到我的配置单元 table,这是 ORC 文件格式。
当我尝试使用
读取文件时
hadoop fs -text /apps/hive/warehouse/emp_rcfileformat/000000_0
或
hive --orcfiledump /apps/hive/warehouse/emp_rcfileformat/000000_0
这没有给我任何结果...
我正在使用配置单元 0.14
如果我使用 orcfiledump 会出错
Exception in thread "main" org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file /apps/hive/warehouse/emp_rcfileformat/000000_0. Invalid postscript.
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.ensureOrcFooter(ReaderImpl.java:248)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:373)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:314)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:237)
at org.apache.hadoop.hive.ql.io.orc.FileDump.printMetaData(FileDump.java:101)
at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:86)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Hive 提供了一个 rcfilecat
工具来显示 RCFiles 的内容:
$ bin/hive --service rcfilecat /user/hive/warehouse/columntable/000000_0
ORC 文件转储实用程序:
The ORC file dump utility analyzes ORC files. To invoke it, use this command:
// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
// Hive version 0.15 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
Adding -d to the command will cause it to dump the data in the ORC file rather than the metadata (Hive 1.1.0 and later).
Adding --rowindex with a comma separated list of column ids will cause it to print row indexes for the specified columns, where 0 is the top level struct containing all of the columns and 1 is the first column id (Hive 1.1.0 and later).
Adding -t to the command will print the timezone id of the writer.
Adding -j to the command will print the ORC file metadata in JSON format. To pretty print the JSON metadata add -p to the command.
<location-of-orc-file> is the URI of the ORC file. From Hive 1.3.0 onward, this URI can be a directory containing ORC files.
Orcfile 与 Rcfile 不同。请确保您的 reader 与文件格式匹配。
我已经将一个文件加载到我的配置单元 table,这是 ORC 文件格式。
当我尝试使用
hadoop fs -text /apps/hive/warehouse/emp_rcfileformat/000000_0
或
hive --orcfiledump /apps/hive/warehouse/emp_rcfileformat/000000_0
这没有给我任何结果... 我正在使用配置单元 0.14
如果我使用 orcfiledump 会出错
Exception in thread "main" org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file /apps/hive/warehouse/emp_rcfileformat/000000_0. Invalid postscript.
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.ensureOrcFooter(ReaderImpl.java:248)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:373)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:314)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:237)
at org.apache.hadoop.hive.ql.io.orc.FileDump.printMetaData(FileDump.java:101)
at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:86)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Hive 提供了一个 rcfilecat
工具来显示 RCFiles 的内容:
$ bin/hive --service rcfilecat /user/hive/warehouse/columntable/000000_0
ORC 文件转储实用程序:
The ORC file dump utility analyzes ORC files. To invoke it, use this command:
// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
// Hive version 0.15 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
Adding -d to the command will cause it to dump the data in the ORC file rather than the metadata (Hive 1.1.0 and later).
Adding --rowindex with a comma separated list of column ids will cause it to print row indexes for the specified columns, where 0 is the top level struct containing all of the columns and 1 is the first column id (Hive 1.1.0 and later).
Adding -t to the command will print the timezone id of the writer.
Adding -j to the command will print the ORC file metadata in JSON format. To pretty print the JSON metadata add -p to the command.
<location-of-orc-file> is the URI of the ORC file. From Hive 1.3.0 onward, this URI can be a directory containing ORC files.
Orcfile 与 Rcfile 不同。请确保您的 reader 与文件格式匹配。