如何将 500GB SQL table 转换为 Apache Parquet?

How to convert an 500GB SQL table into Apache Parquet?

也许这是有据可查的,但我很困惑如何做到这一点(有很多 Apache 工具)。

当我创建 SQL table 时,我使用以下命令创建 table:

CREATE TABLE table_name(
   column1 datatype,
   column2 datatype,
   column3 datatype,
   .....
   columnN datatype,
   PRIMARY KEY( one or more columns )
);

如何将此存在 table 转换为 Parquet?此文件写入磁盘?如果原始数据是几GB,要等多久?

我可以将原始原始数据格式化为 Parquet 格式吗?

Apache Spark 可用于执行此操作:

1.load your table from mysql via jdbc
2.save it as a parquet file

示例:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.jdbc("YOUR_MYSQL_JDBC_CONN_STRING",  "YOUR_TABLE",properties={"user": "YOUR_USER", "password": "YOUR_PASSWORD"})
df.write.parquet("YOUR_HDFS_FILE")

odbc2parquet 命令行工具在某些情况下也可能有用。

odbc2parquet \
-vvv \ # Log output, good to know it is still doing something during large downloads
query \ # Subcommand for accessing data and storing it
--connection-string ${ODBC_CONNECTION_STRING} \
--batch-size 100000 \ # Batch size in rows
--batches-per-file 100 \ # Ommit to store entire query in a single file
out.par \ # Path to output parquet file
"SELECT * FROM YourTable"