Athena 无法读取 CSV 字段中的多行文本
Athena not able to read multi-line text in CSV fields
这个 athena table 正确读取了文件的第一行。
CREATE EXTERNAL TABLE `test_delete_email5`(
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string,
`col6` string,
`col7` string,
`col8` string,
`col9` string,
`col10` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'LINES TERMINATED BY' = '\n',
'ESCAPED BY' = '\',
'quoteChar' = '\"'
) LOCATION 's3://testme162/email_backup/email5/'
TBLPROPERTIES ('has_encrypted_data'='false')
由于在第 5 列中找到 html 代码,此 table 未正确导入。还有其他办法吗?
您的文件似乎在 textbody
字段中包含大量 multi-line 文本。这不是 CSV 标准(或者至少,它不能被 OpenCSVSerde 理解)。
作为测试,我制作了一个简单的文件:
"newsletterid","name","format","subject","textbody","htmlbody","createdate","active","archive","ownerid"
"one","two","three","four","five","six","seven","eight","nine","ten"
"one","two","three","four","five \" quote \" five2","six","seven","eight","nine","ten"
"one","two","three","four","five \
five2","six","seven","eight","nine","ten"
- 第 1 行是 header
- 第 2 行正常
- 第 3 行有一个带有
\"
个转义引号的字段
- 第 4 行已转义换行符
然后我 运行 你问题中的命令并将它指向这个数据文件。
结果:
- 返回第 1-3 行(包括 header 行)
- 第 4 行只工作到
\
-- 之后的数据丢失了
底线:您的文件格式与 CSV 格式不兼容。
您可能能够找到一些可以处理它的 Serde,但 OpenCSVSerde 似乎不理解它,因为行通常由换行符分隔。
这个 athena table 正确读取了文件的第一行。
CREATE EXTERNAL TABLE `test_delete_email5`(
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string,
`col6` string,
`col7` string,
`col8` string,
`col9` string,
`col10` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'LINES TERMINATED BY' = '\n',
'ESCAPED BY' = '\',
'quoteChar' = '\"'
) LOCATION 's3://testme162/email_backup/email5/'
TBLPROPERTIES ('has_encrypted_data'='false')
由于在第 5 列中找到 html 代码,此 table 未正确导入。还有其他办法吗?
您的文件似乎在 textbody
字段中包含大量 multi-line 文本。这不是 CSV 标准(或者至少,它不能被 OpenCSVSerde 理解)。
作为测试,我制作了一个简单的文件:
"newsletterid","name","format","subject","textbody","htmlbody","createdate","active","archive","ownerid"
"one","two","three","four","five","six","seven","eight","nine","ten"
"one","two","three","four","five \" quote \" five2","six","seven","eight","nine","ten"
"one","two","three","four","five \
five2","six","seven","eight","nine","ten"
- 第 1 行是 header
- 第 2 行正常
- 第 3 行有一个带有
\"
个转义引号的字段 - 第 4 行已转义换行符
然后我 运行 你问题中的命令并将它指向这个数据文件。
结果:
- 返回第 1-3 行(包括 header 行)
- 第 4 行只工作到
\
-- 之后的数据丢失了
底线:您的文件格式与 CSV 格式不兼容。
您可能能够找到一些可以处理它的 Serde,但 OpenCSVSerde 似乎不理解它,因为行通常由换行符分隔。