在 Hive 中处理带和不带双引号的数据

Handling Data with and Without double quotation marks In Hive

有人可以指导我如何将数据加载到配置单元中,我在某些行中获得“在某些行中,但在某些行中,对于相同的列值,数据没有”。

    Sample Data:

    id,name,desc,uqc,roll,age
    1,Monali,"abhc,jkjk",,23,23
    2,mj,nhiijkla,67,23,60
    7,jena,"kdjuu,hsysi,juw",3,34,23
    1,Monali,"/"coppers bars","rods and profiles"/",,23,23
    2,money,"/"COUPLING","FLANGES & CROSS OVER"/",67,23,60

在上述数据中,id '2' 不在 desc 列值中。

我的创建语句:

    create external table testing(id int, 
                  name string, 
                  desc string, 
                  uqc double, 
                  roll int, 
                  age int
                 ) 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
    WITH SERDEPROPERTIES ('input.regex'='^(\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*')
    location ....
    TBLPROPERTIES("skip.header.line.count"="1")
    ;

加载数据时我没有收到任何错误。但是当我做 select * from testing.select 语句时没有执行。上面的 Create 和 select 语句工作正常如果数据带有“,但如果数据带有和不带有”则不起作用。

试试这个标签:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

目前正则表达式中的第三组用引号引起来(引号是强制性的)。尝试使引号可选 "? - 表示零个或一个引号,也使组内容非贪婪 (.*?),因此它不会在组内捕获额外的引号:

'input.regex'='^(\d+?),(.*?),"?(.*?)"?,([0-9.]*),(\d*),(\d*).*' 

使用 regexp_replace 测试您的数据示例,我还在第三组周围添加了可选的斜线以将其从输出中删除:

with mytable as (
select stack(6,
    '1,Monali,"abhc,jkjk",,23,23',
    '2,mj,nhiijkla,67,23,60',
    '7,jena,"kdjuu,hsysi,juw",3,34,23',
    '1,Monali,"/"coppers bars","rods and profiles"/",,23,23',
    '2,money,"/"COUPLING","FLANGES & CROSS OVER"/",67,23,60',
    '2,money,"17",19"LCD PANEL FOR COMPUTER",67,23,60'
) as initial_data
)

select regexp_replace(initial_data,'^(\d+?),(.*?),"?/?(.*?)/?"?,([0-9.]*),(\d*),(\d*).*',
                                   ' ||  ||  ||  ||  || '
                     ) as parsed_result
 from mytable

结果(由两个竖线和空格分隔 ' || '):

parsed_result
1 || Monali || abhc,jkjk || || 23 || 23
2 || mj || nhiijkla || 67 || 23 || 60
7 || jena || kdjuu,hsysi,juw || 3 || 34 || 23
1 || Monali || "coppers bars","rods and profiles" || || 23 || 23
2 || money || "COUPLING","FLANGES & CROSS OVER" || 67 || 23 || 60
2 || money || 17",19"LCD PANEL FOR COMPUTER || 67 || 23 || 60

因此,如果结果看起来不错,请在 table DDL 中使用此正则表达式:

'input.regex'='^(\d+?),(.*?),"?/?(.*?)/?"?,([0-9.]*),(\d*),(\d*).*'

在整个数据集上仔细测试它并检查 empty/null 值,必要时修复正则表达式。