将数据从雪花卸载到 s3 时,如何将日期时间戳添加到 zip 文件?
how can I add datetime stamp to zip file when unload data from snowflake to s3?
我希望能够为我正在写入 s3 的文件名添加时间戳。到目前为止,我已经能够使用下面的示例将文件写入 AWS S3。有人可以指导我如何在文件名中添加日期时间戳吗?
copy into @s3bucket/something.csv.gz
from (select * from mytable)
file_format = (type=csv FIELD_OPTIONALLY_ENCLOSED_BY = '"' compression='gzip' )
single=true
header=TRUE;
提前致谢。
COPY INTO
语句的阶段或位置部分内的 syntax for defining a path 不允许函数在 SQL 中动态定义它。
但是,您可以使用 stored procedure to accomplish building dynamic queries, using JavaScript Date APIs and some string formatting。
这是您的用例的一个非常简单的示例,其中包含一些代码 adapted from another question:
CREATE OR REPLACE PROCEDURE COPY_INTO_PROCEDURE_EXAMPLE()
RETURNS VARIANT
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
var rows = [];
var n = new Date();
// May need refinement to zero-pad some values or achieve a specific format
var datetime = `${n.getFullYear()}-${n.getMonth() + 1}-${n.getDate()}-${n.getHours()}-${n.getMinutes()}-${n.getSeconds()}`;
var st = snowflake.createStatement({
sqlText: `COPY INTO '@s3bucket/${datetime}_something.csv.gz' FROM (SELECT * FROM mytable) FILE_FORMAT=(TYPE=CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' COMPRESSION='gzip') SINGLE=TRUE HEADER=TRUE;`
});
var result = st.execute();
result.next();
rows.push(result.getColumnValue(1))
return rows;
$$
要执行,运行:
CALL COPY_INTO_PROCEDURE_EXAMPLE();
上面缺少完善的日期格式处理(零填充月、日、小时、分钟、秒)、错误处理(如果 COPY INTO
失败)、输入查询的参数化等,但它应该给出一个关于如何实现这一目标的总体思路。
雪花尚不支持此功能,但很快就会支持。
正如 Sharvan Kumar 上面所建议的,Snowflake 现在支持这个:
-- Partition the unloaded data by date and hour. Set ``32000000`` (32 MB) as the upper size limit of each file to be generated in parallel per thread.
copy into @%t1
from t1
partition by ('date=' || to_varchar(dt, 'YYYY-MM-DD') || '/hour=' || to_varchar(date_part(hour, ts))) -- Concatenate labels and column values to output meaningful filenames
file_format = (type=parquet)
max_file_size = 32000000
header=true;
list @%t1
我希望能够为我正在写入 s3 的文件名添加时间戳。到目前为止,我已经能够使用下面的示例将文件写入 AWS S3。有人可以指导我如何在文件名中添加日期时间戳吗?
copy into @s3bucket/something.csv.gz
from (select * from mytable)
file_format = (type=csv FIELD_OPTIONALLY_ENCLOSED_BY = '"' compression='gzip' )
single=true
header=TRUE;
提前致谢。
COPY INTO
语句的阶段或位置部分内的 syntax for defining a path 不允许函数在 SQL 中动态定义它。
但是,您可以使用 stored procedure to accomplish building dynamic queries, using JavaScript Date APIs and some string formatting。
这是您的用例的一个非常简单的示例,其中包含一些代码 adapted from another question:
CREATE OR REPLACE PROCEDURE COPY_INTO_PROCEDURE_EXAMPLE()
RETURNS VARIANT
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
var rows = [];
var n = new Date();
// May need refinement to zero-pad some values or achieve a specific format
var datetime = `${n.getFullYear()}-${n.getMonth() + 1}-${n.getDate()}-${n.getHours()}-${n.getMinutes()}-${n.getSeconds()}`;
var st = snowflake.createStatement({
sqlText: `COPY INTO '@s3bucket/${datetime}_something.csv.gz' FROM (SELECT * FROM mytable) FILE_FORMAT=(TYPE=CSV FIELD_OPTIONALLY_ENCLOSED_BY='"' COMPRESSION='gzip') SINGLE=TRUE HEADER=TRUE;`
});
var result = st.execute();
result.next();
rows.push(result.getColumnValue(1))
return rows;
$$
要执行,运行:
CALL COPY_INTO_PROCEDURE_EXAMPLE();
上面缺少完善的日期格式处理(零填充月、日、小时、分钟、秒)、错误处理(如果 COPY INTO
失败)、输入查询的参数化等,但它应该给出一个关于如何实现这一目标的总体思路。
雪花尚不支持此功能,但很快就会支持。
正如 Sharvan Kumar 上面所建议的,Snowflake 现在支持这个:
-- Partition the unloaded data by date and hour. Set ``32000000`` (32 MB) as the upper size limit of each file to be generated in parallel per thread.
copy into @%t1
from t1
partition by ('date=' || to_varchar(dt, 'YYYY-MM-DD') || '/hour=' || to_varchar(date_part(hour, ts))) -- Concatenate labels and column values to output meaningful filenames
file_format = (type=parquet)
max_file_size = 32000000
header=true;
list @%t1