如何查看 parquet 元数据中的 min/max 索引?
How do I view min/max index in parquet metadata?
我正在尝试使用 parquet 的 min/max 索引。我在这里跟随question/answer:
scala> val foo = spark.sql("select id, cast(id as string) text from range(1000)").sort("id")
scala> foo.printSchema
root
|-- id: long (nullable = false)
|-- text: string (nullable = false)
当我查看单个镶木地板文件时,我没有看到任何 min/max
> parquet-tools meta part-00000-tid-5174196010762120422-9
5fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
file: file:.../part-00000-tid-5174196010762120422-95fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
creator: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"text","type":"string","nullable":false,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
id: REQUIRED INT64 R:0 D:0
text: REQUIRED BINARY O:UTF8 R:0 D:0
row group 1: RC:125 TS:1840 OFFSET:4
--------------------------------------------------------------------------------
id: INT64 GZIP DO:0 FPO:4 SZ:259/1044/4.03 VC:125 ENC:PLAIN,BIT_PACKED
text: BINARY GZIP DO:0 FPO:263 SZ:263/796/3.03 VC:125 ENC:PLAIN,BIT_PACKED
我试过 .sortWithinPartitions("id") 得到了相同的结果。
您可以使用镶木地板工具查看统计信息。在你的情况下,你会 运行
parquet-tools dump -d -n part-00000-tid-5174196010762120422-95fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
截至今天(2017 年 6 月 9 日),带有 Parquet 1.8.1 的 Spark 2.1.1 不会为字符串等二进制列生成统计信息。
你想要 https://github.com/apache/arrow/tree/master/cpp
的镶木地板-reader
祝编译顺利
然后你会得到一些像这样的元数据,用你的 min/max
parquet-reader --only-metadata --json --columns=0,1,2 widey_event_visit_start_datetime_sorted.pq
{
"FileName": "widey_event_visit_start_datetime_sorted.pq",
"Version": "0",
"CreatedBy": "parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)",
"TotalRows": "732999",
"NumberOfRowGroups": "1",
"NumberOfRealColumns": "88",
"NumberOfColumns": "88",
"Columns": [
{ "Id": "0", "Name": "destination", "PhysicalType": "BYTE_ARRAY", "LogicalType": "UTF8" },
{ "Id": "1", "Name": "visit_id", "PhysicalType": "INT32", "LogicalType": "NONE" },
{ "Id": "2", "Name": "visit_start_datetime", "PhysicalType": "INT64", "LogicalType": "NONE" }
],
"RowGroups": [
{
"Id": "0", "TotalBytes": "125009099", "Rows": "732999",
"ColumnChunks": [
{"Id": "0", "Values": "732999", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "WS_Programmes_TEST", "Min": "GNL_News_TEST" },
"Compression": "SNAPPY", "Encodings": "PLAIN_DICTIONARY RLE BIT_PACKED ", "UncompressedSize": "166512", "CompressedSize": "134481" },
{"Id": "1", "Values": "732999", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "60419931", "Min": "1072" },
"Compression": "SNAPPY", "Encodings": "PLAIN_DICTIONARY RLE BIT_PACKED ", "UncompressedSize": "860549", "CompressedSize": "786120" },
{"Id": "2", "Values": "732999", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1548892673", "Min": "1548806803" },
"Compression": "SNAPPY", "Encodings": "PLAIN_DICTIONARY RLE BIT_PACKED ", "UncompressedSize": "5413", "CompressedSize": "3965" }
]
}
]
}
我正在尝试使用 parquet 的 min/max 索引。我在这里跟随question/answer:
scala> val foo = spark.sql("select id, cast(id as string) text from range(1000)").sort("id")
scala> foo.printSchema
root
|-- id: long (nullable = false)
|-- text: string (nullable = false)
当我查看单个镶木地板文件时,我没有看到任何 min/max
> parquet-tools meta part-00000-tid-5174196010762120422-9
5fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
file: file:.../part-00000-tid-5174196010762120422-95fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
creator: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"text","type":"string","nullable":false,"metadata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
id: REQUIRED INT64 R:0 D:0
text: REQUIRED BINARY O:UTF8 R:0 D:0
row group 1: RC:125 TS:1840 OFFSET:4
--------------------------------------------------------------------------------
id: INT64 GZIP DO:0 FPO:4 SZ:259/1044/4.03 VC:125 ENC:PLAIN,BIT_PACKED
text: BINARY GZIP DO:0 FPO:263 SZ:263/796/3.03 VC:125 ENC:PLAIN,BIT_PACKED
我试过 .sortWithinPartitions("id") 得到了相同的结果。
您可以使用镶木地板工具查看统计信息。在你的情况下,你会 运行
parquet-tools dump -d -n part-00000-tid-5174196010762120422-95fb2e22-0dfb-4597-bdca-4fb573873959-0-c000.gz.parquet
截至今天(2017 年 6 月 9 日),带有 Parquet 1.8.1 的 Spark 2.1.1 不会为字符串等二进制列生成统计信息。
你想要 https://github.com/apache/arrow/tree/master/cpp
的镶木地板-reader祝编译顺利
然后你会得到一些像这样的元数据,用你的 min/max
parquet-reader --only-metadata --json --columns=0,1,2 widey_event_visit_start_datetime_sorted.pq
{
"FileName": "widey_event_visit_start_datetime_sorted.pq",
"Version": "0",
"CreatedBy": "parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)",
"TotalRows": "732999",
"NumberOfRowGroups": "1",
"NumberOfRealColumns": "88",
"NumberOfColumns": "88",
"Columns": [
{ "Id": "0", "Name": "destination", "PhysicalType": "BYTE_ARRAY", "LogicalType": "UTF8" },
{ "Id": "1", "Name": "visit_id", "PhysicalType": "INT32", "LogicalType": "NONE" },
{ "Id": "2", "Name": "visit_start_datetime", "PhysicalType": "INT64", "LogicalType": "NONE" }
],
"RowGroups": [
{
"Id": "0", "TotalBytes": "125009099", "Rows": "732999",
"ColumnChunks": [
{"Id": "0", "Values": "732999", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "WS_Programmes_TEST", "Min": "GNL_News_TEST" },
"Compression": "SNAPPY", "Encodings": "PLAIN_DICTIONARY RLE BIT_PACKED ", "UncompressedSize": "166512", "CompressedSize": "134481" },
{"Id": "1", "Values": "732999", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "60419931", "Min": "1072" },
"Compression": "SNAPPY", "Encodings": "PLAIN_DICTIONARY RLE BIT_PACKED ", "UncompressedSize": "860549", "CompressedSize": "786120" },
{"Id": "2", "Values": "732999", "StatsSet": "True", "Stats": {"NumNulls": "0", "DistinctValues": "0", "Max": "1548892673", "Min": "1548806803" },
"Compression": "SNAPPY", "Encodings": "PLAIN_DICTIONARY RLE BIT_PACKED ", "UncompressedSize": "5413", "CompressedSize": "3965" }
]
}
]
}