如何处理列值中的分隔符？

Question

我正在尝试将 CSV 文件数据加载到我的 Hive table，但是它在一列的值中有分隔符 (,) ，因此 Hive 将它作为分隔符并将其加载到一个新的柱子。我尝试使用转义序列 \ 但我也 \ （它不起作用并且总是在 , 之后在新列中加载数据。

我的 CSV 文件。:

        id,name,desc,per1,roll,age
        226,a1,"\"double bars","item1 and item2\"",0.0,10,25
        227,a2,"\"doubles","item2 & item3 item4\"",0.1,20,35
        228,a3,"\"double","item3 & item4 item5\"",0.2,30,45
        229,a4,"\"double","item5 & item6 item7\"",0.3,40,55

我更新了我的 table.:

    create table testing(id int, name string, desc string, uqc double, roll int, age int) 
    ROW   FORMAT SERDE 
    'org.apache.hadoop.hive.serde2.OpenCSVSerde'
     WITH SERDEPROPERTIES (
    "separatorChar" = ",",
    "quoteChar" = '"',
    "escapeChar" = "\" ) STORED AS textfile;

但我仍然在 , 之后的不同列中获取数据。

我在路径命令中使用加载数据。

Answer 1

这是基于 RegexSerDe 创建 table 的方法。

每列在正则表达式中应该有相应的捕获组 ()。您可以使用 regex_replace:

轻松调试正则表达式而无需创建 table

select regexp_replace('226,a1,"\"double bars","item1 and item2\"",0.0,10,25',
                      '^(\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*', --6 groups
                     '     '); --space delimited fields

结果：

226 a1 "double bars","item1 and item2" 0.0 10 25

如果觉得不错，就创建 table:

 create external table testing(id int, 
                      name string, 
                      desc string, 
                      uqc double, 
                      roll int, 
                      age int
                     ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ('input.regex'='^(\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*')
location ....
TBLPROPERTIES("skip.header.line.count"="1")
;

阅读此 article 了解更多详情。

如何处理列值中的分隔符？

How to handle delimiter in column value?

hadoop

hive

create-table

hiveql

hiveddl