Snowflake：SQS-SNS 能否为 COPY INTO 提供精细路径？

Question

我正在将数据从 S3 文件夹加载到 Snowflake，它也有很多子文件夹。由于设计限制，我无法更改文件夹结构或删除加载的文件。在阅读一些 best practices 的 ELT 时，他们建议将数据加载到这样的粒度路径中：

-- Simple method:  Scan the entire stage
copy into sales_table
  from @landing_data
  pattern='.*[.]csv';

-- Most Flexible method:  Limit within directory
copy into sales_table
  from @landing_data/sales/transactions/2020/05
  pattern='.*[.]csv';

-- Fastest method:  A named file
copy into sales_table
  from @landing_data/sales/transactions/2020/05/sales_050.csv;

然而，如上所述，我最好的只有 @landing_data/sales/transactions，它会根据日期增长，并使性能随着时间的推移而下降。在阅读 guide to use SNS topic 时，它表示：

Note that the pipe will only copy files to the ingest queue triggered by event notifications via the SNS topic.

我有一些问题：

如果我没理解错的话，这意味着SNS会为Snowpipe提供那个文件的路径，这使得加载过程已经使用了粒度路径？
如果以上是错误的，有什么方法可以确保性能不会随着时间的推移而下降？我不允许更改 S3 结构，也不允许在加载后删除文件。

Answer 1

If I understand correctly, it means that SNS will provide the path of that file for Snowpipe, which makes the loading process already use a granular path?

正确。来自 https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html#step-3-create-a-pipe-with-auto-ingest-enabled:

Data files are loaded in a stage.

An S3 event notification informs Snowpipe via an SQS queue that files are ready to load. Snowpipe copies the files into a queue.

A Snowflake-provided virtual warehouse loads data from the queued files into the target table based on parameters defined in the

specified pipe.

它是“从排队的文件中加载数据”，表明您在这里要查找的内容。这使 Snowpipe 不必列出文件夹的内容（这是导致非粒度路径性能问题的主要原因）。

请注意，为此您不需要 Snowpipe - COPY INTO 具有 FILES 选项，可让您指定单个文件。

Snowflake：SQS-SNS 能否为 COPY INTO 提供精细路径？

Snowflake: can an SQS-SNS provide granular path for COPY INTO?

snowflake-cloud-data-platform