当我们从 s3 中的 csv 文件读取数据并在 aws athena 中创建 table 时，如何跳过 headers。

Question

我正在尝试从 s3 存储桶中读取 csv 数据并在 AWS Athena 中创建 table。我的 table 创建时无法跳过我的 CSV 文件的 header 信息。

查询示例：

CREATE EXTERNAL TABLE IF NOT EXISTS table_name (   `event_type_id`
     string,   `customer_id` string,   `date` string,   `email` string )
     ROW FORMAT SERDE  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
     WITH
     SERDEPROPERTIES (   "separatorChar" = "|",   "quoteChar"     = "\"" )
     LOCATION 's3://location/' 
     TBLPROPERTIES ("skip.header.line.count"="1");

skip.header.line.count 好像不行。但这行不通。我认为 Aws 在 this.Is 方面存在一些问题，还有其他方法可以解决这个问题吗？

Answer 1

这是一个已知的缺陷。

我见过的最好的方法是 tweeted by Eric Hammond:

...WHERE date NOT LIKE '#%'

这似乎在查询期间跳过 header 行。我不确定它是如何工作的，但它可能是一种跳过 NULL 的方法。

Answer 2

这在 Redshift 中有效：

您想使用table properties ('skip.header.line.count'='1') 如果需要，可以与其他属性一起使用，例如'numRows'='100'。这是一个示例：

create external table exreddb1.test_table
(ID BIGINT 
,NAME VARCHAR
)
row format delimited
fields terminated by ','
stored as textfile
location 's3://mybucket/myfolder/'
table properties ('numRows'='100', 'skip.header.line.count'='1');

Answer 3

截至今天（2019-11-18），来自 OP 的查询似乎有效。即 skip.header.line.count 被尊重，第一行确实被跳过。

当我们从 s3 中的 csv 文件读取数据并在 aws athena 中创建 table 时，如何跳过 headers。

How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.

csv

amazon-s3

amazon-web-services

amazon-athena