无法让 AWS Athena escapeChar 工作
Unable to get AWS Athena escapeChar working
我正在尝试试用 AWS Athena,但我 运行 遇到了我正在尝试测试的 csv 文件的问题。
使用以下 escapeChar 似乎不起作用。
我已经尝试使用爬虫并在 UI 中指定 escapeChar,使用和不使用双引号,但仍然没有成功。当行在字符串中有分隔符时,即使它被转义也会被读取为字段分隔符。
DDL
CREATE EXTERNAL TABLE mytestcsvtable (
col_id string,
col_description string,
col_text string,
col_decimal string,
col_float string,
col_date string,
col_time string,
col_timestamp string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = '|',
'escapeChar' = '\'
)
STORED AS TEXTFILE
LOCATION 's3://mystaging2/tmp/';
数据
401|SingleBackslash|Goblin \Gobby Gruesome|.00|0E0|2020-03-02|21.21.43|2020-03-02 12.35.52.145894
402|EvenNumberOfSingleBackslash|Goblin \Gobby\ Gruesome|.00|0E0|2020-03-02|22.22.44|2020-03-02 12.35.52.156563
403|SingleSingleBackslash|Goblin \\Gobby\\ Gruesome|.00|0E0|2020-03-02|23.23.45|2020-03-02 12.35.52.158011
404|OddNumberOfSingleBackslash1|Goblin \Gruesome\ \Gruesome|.00|0E0|2020-03-02|00.24.46|2020-03-02 12.35.52.159835
405|OddNumberOfSingleBackslash2|Goblin \\Gobby\ Gruesome|.00|0E0|2020-03-02|01.25.47|2020-03-02 12.35.52.162538
406|OddNumberOfSingleSingleBackslash1|Goblin \\Gruesome\\ \\Gruesome|.00|0E0|2020-03-02|02.26.48|2020-03-02 12.35.52.163510
407|OddNumberOfSingleSingleBackslash2|Goblin \\Gobby\\\ \\Gruesome|.00|0E0|2020-03-02|03.27.49|2020-03-02 12.35.53.167322
408|SingleSingleSingleBackslash1|Goblin \\\Gobby|.00|0E0|2020-03-02|04.28.50|2020-03-02 12.35.53.179868
501|SinglePipe|Goblin \|Gobby Gruesome|.00|0E0|2020-03-02|05.29.51|2020-03-02 12.35.53.180025
502|EvenNumberOfSinglePipe|Goblin \|Gobby\| Gruesome|.00|0E0|2020-03-02|06.30.52|2020-03-02 12.35.53.184042
503|SingleSinglePipe|Goblin \|\|Gobby\|\| Gruesome|.00|0E0|2020-03-02|07.31.53|2020-03-02 12.35.53.189979
504|OddNumberOfSinglePipe1|Goblin \|Gruesome\| \|Gruesome|.00|0E0|2020-03-02|08.32.54|2020-03-02 12.35.53.194734
505|OddNumberOfSinglePipe2|Goblin \|\|Gobby\| Gruesome|.00|0E0|2020-03-02|09.33.55|2020-03-02 12.35.53.196996
506|OddNumberOfSingleSinglePipe1|Goblin \|\|Gruesome\|\| \|\|Gruesome|.00|0E0|2020-03-02|10.34.56|2020-03-02 12.35.53.203568
507|OddNumberOfSingleSinglePipe2|Goblin \|\|Gobby\|\|\| \|\|Gruesome|.00|0E0|2020-03-02|11.35.57|2020-03-02 12.35.53.203999
508|SingleSingleSinglePipe1|Goblin \|\|\|Gobby|.00|0E0|2020-03-02|12.36.58|2020-03-02 12.35.54.208965
感谢您花时间阅读我的post!
在 csv 中,escapeChar
用于转义引用字段内可能的 quoteChar
,而不是转义分隔符。
要“转义”字段内的定界符,必须引用该字段 - 例如:
501|SinglePipe|Goblin \|Gobby Gruesome|.00|0E0|2020-03-02
变为:
501|SinglePipe|"Goblin \|Gobby Gruesome"|.00|0E0|2020-03-02
or
501|SinglePipe|"Goblin |Gobby Gruesome"|.00|0E0|2020-03-02
如果字段中有引号,则必须转义:
501|SinglePipe|"Goblin|24\" monitor|Gobby Gruesome"|.00|0E0|2020-03-02
另外,如果这不是默认值,请确保告诉解析器使用 quoteChar
。
我正在尝试试用 AWS Athena,但我 运行 遇到了我正在尝试测试的 csv 文件的问题。
使用以下 escapeChar 似乎不起作用。 我已经尝试使用爬虫并在 UI 中指定 escapeChar,使用和不使用双引号,但仍然没有成功。当行在字符串中有分隔符时,即使它被转义也会被读取为字段分隔符。
DDL
CREATE EXTERNAL TABLE mytestcsvtable (
col_id string,
col_description string,
col_text string,
col_decimal string,
col_float string,
col_date string,
col_time string,
col_timestamp string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = '|',
'escapeChar' = '\'
)
STORED AS TEXTFILE
LOCATION 's3://mystaging2/tmp/';
数据
401|SingleBackslash|Goblin \Gobby Gruesome|.00|0E0|2020-03-02|21.21.43|2020-03-02 12.35.52.145894
402|EvenNumberOfSingleBackslash|Goblin \Gobby\ Gruesome|.00|0E0|2020-03-02|22.22.44|2020-03-02 12.35.52.156563
403|SingleSingleBackslash|Goblin \\Gobby\\ Gruesome|.00|0E0|2020-03-02|23.23.45|2020-03-02 12.35.52.158011
404|OddNumberOfSingleBackslash1|Goblin \Gruesome\ \Gruesome|.00|0E0|2020-03-02|00.24.46|2020-03-02 12.35.52.159835
405|OddNumberOfSingleBackslash2|Goblin \\Gobby\ Gruesome|.00|0E0|2020-03-02|01.25.47|2020-03-02 12.35.52.162538
406|OddNumberOfSingleSingleBackslash1|Goblin \\Gruesome\\ \\Gruesome|.00|0E0|2020-03-02|02.26.48|2020-03-02 12.35.52.163510
407|OddNumberOfSingleSingleBackslash2|Goblin \\Gobby\\\ \\Gruesome|.00|0E0|2020-03-02|03.27.49|2020-03-02 12.35.53.167322
408|SingleSingleSingleBackslash1|Goblin \\\Gobby|.00|0E0|2020-03-02|04.28.50|2020-03-02 12.35.53.179868
501|SinglePipe|Goblin \|Gobby Gruesome|.00|0E0|2020-03-02|05.29.51|2020-03-02 12.35.53.180025
502|EvenNumberOfSinglePipe|Goblin \|Gobby\| Gruesome|.00|0E0|2020-03-02|06.30.52|2020-03-02 12.35.53.184042
503|SingleSinglePipe|Goblin \|\|Gobby\|\| Gruesome|.00|0E0|2020-03-02|07.31.53|2020-03-02 12.35.53.189979
504|OddNumberOfSinglePipe1|Goblin \|Gruesome\| \|Gruesome|.00|0E0|2020-03-02|08.32.54|2020-03-02 12.35.53.194734
505|OddNumberOfSinglePipe2|Goblin \|\|Gobby\| Gruesome|.00|0E0|2020-03-02|09.33.55|2020-03-02 12.35.53.196996
506|OddNumberOfSingleSinglePipe1|Goblin \|\|Gruesome\|\| \|\|Gruesome|.00|0E0|2020-03-02|10.34.56|2020-03-02 12.35.53.203568
507|OddNumberOfSingleSinglePipe2|Goblin \|\|Gobby\|\|\| \|\|Gruesome|.00|0E0|2020-03-02|11.35.57|2020-03-02 12.35.53.203999
508|SingleSingleSinglePipe1|Goblin \|\|\|Gobby|.00|0E0|2020-03-02|12.36.58|2020-03-02 12.35.54.208965
感谢您花时间阅读我的post!
在 csv 中,escapeChar
用于转义引用字段内可能的 quoteChar
,而不是转义分隔符。
要“转义”字段内的定界符,必须引用该字段 - 例如:
501|SinglePipe|Goblin \|Gobby Gruesome|.00|0E0|2020-03-02
变为:
501|SinglePipe|"Goblin \|Gobby Gruesome"|.00|0E0|2020-03-02
or
501|SinglePipe|"Goblin |Gobby Gruesome"|.00|0E0|2020-03-02
如果字段中有引号,则必须转义:
501|SinglePipe|"Goblin|24\" monitor|Gobby Gruesome"|.00|0E0|2020-03-02
另外,如果这不是默认值,请确保告诉解析器使用 quoteChar
。