Athena 无法读取 CSV 字段中的多行文本

Athena not able to read multi-line text in CSV fields

这个 athena table 正确读取了文件的第一行。

CREATE EXTERNAL TABLE `test_delete_email5`(
`col1` string, 
`col2` string, 
`col3` string, 
`col4` string,
`col5` string,
`col6` string,  
`col7` string,  
`col8` string,  
`col9` string,  
`col10` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'LINES TERMINATED BY' = '\n',
'ESCAPED BY' = '\',
'quoteChar'     = '\"'
) LOCATION 's3://testme162/email_backup/email5/'
TBLPROPERTIES ('has_encrypted_data'='false')

由于在第 5 列中找到 html 代码,此 table 未正确导入。还有其他办法吗?

您的文件似乎在 textbody 字段中包含大量 multi-line 文本。这不是 CSV 标准(或者至少,它不能被 OpenCSVSerde 理解)。

作为测试,我制作了一个简单的文件:

"newsletterid","name","format","subject","textbody","htmlbody","createdate","active","archive","ownerid"
"one","two","three","four","five","six","seven","eight","nine","ten"
"one","two","three","four","five \" quote \" five2","six","seven","eight","nine","ten"
"one","two","three","four","five \
five2","six","seven","eight","nine","ten"
  • 第 1 行是 header
  • 第 2 行正常
  • 第 3 行有一个带有 \" 个转义引号的字段
  • 第 4 行已转义换行符

然后我 运行 你问题中的命令并将它指向这个数据文件。

结果:

  • 返回第 1-3 行(包括 header 行)
  • 第 4 行只工作到 \ -- 之后的数据丢失了

底线:您的文件格式与 CSV 格式不兼容。

可能能够找到一些可以处理它的 Serde,但 OpenCSVSerde 似乎不理解它,因为行通常由换行符分隔。