在 Hive 中处理带和不带双引号的数据
Handling Data with and Without double quotation marks In Hive
有人可以指导我如何将数据加载到配置单元中,我在某些行中获得“在某些行中,但在某些行中,对于相同的列值,数据没有”。
Sample Data:
id,name,desc,uqc,roll,age
1,Monali,"abhc,jkjk",,23,23
2,mj,nhiijkla,67,23,60
7,jena,"kdjuu,hsysi,juw",3,34,23
1,Monali,"/"coppers bars","rods and profiles"/",,23,23
2,money,"/"COUPLING","FLANGES & CROSS OVER"/",67,23,60
在上述数据中,id '2' 不在 desc 列值中。
我的创建语句:
create external table testing(id int,
name string,
desc string,
uqc double,
roll int,
age int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='^(\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*')
location ....
TBLPROPERTIES("skip.header.line.count"="1")
;
加载数据时我没有收到任何错误。但是当我做 select * from testing.select 语句时没有执行。上面的 Create 和 select 语句工作正常如果数据带有“,但如果数据带有和不带有”则不起作用。
试试这个标签:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
目前正则表达式中的第三组用引号引起来(引号是强制性的)。尝试使引号可选 "?
- 表示零个或一个引号,也使组内容非贪婪 (.*?)
,因此它不会在组内捕获额外的引号:
'input.regex'='^(\d+?),(.*?),"?(.*?)"?,([0-9.]*),(\d*),(\d*).*'
使用 regexp_replace 测试您的数据示例,我还在第三组周围添加了可选的斜线以将其从输出中删除:
with mytable as (
select stack(6,
'1,Monali,"abhc,jkjk",,23,23',
'2,mj,nhiijkla,67,23,60',
'7,jena,"kdjuu,hsysi,juw",3,34,23',
'1,Monali,"/"coppers bars","rods and profiles"/",,23,23',
'2,money,"/"COUPLING","FLANGES & CROSS OVER"/",67,23,60',
'2,money,"17",19"LCD PANEL FOR COMPUTER",67,23,60'
) as initial_data
)
select regexp_replace(initial_data,'^(\d+?),(.*?),"?/?(.*?)/?"?,([0-9.]*),(\d*),(\d*).*',
' || || || || || '
) as parsed_result
from mytable
结果(由两个竖线和空格分隔 ' || '
):
parsed_result
1 || Monali || abhc,jkjk || || 23 || 23
2 || mj || nhiijkla || 67 || 23 || 60
7 || jena || kdjuu,hsysi,juw || 3 || 34 || 23
1 || Monali || "coppers bars","rods and profiles" || || 23 || 23
2 || money || "COUPLING","FLANGES & CROSS OVER" || 67 || 23 || 60
2 || money || 17",19"LCD PANEL FOR COMPUTER || 67 || 23 || 60
因此,如果结果看起来不错,请在 table DDL 中使用此正则表达式:
'input.regex'='^(\d+?),(.*?),"?/?(.*?)/?"?,([0-9.]*),(\d*),(\d*).*'
在整个数据集上仔细测试它并检查 empty/null 值,必要时修复正则表达式。
有人可以指导我如何将数据加载到配置单元中,我在某些行中获得“在某些行中,但在某些行中,对于相同的列值,数据没有”。
Sample Data:
id,name,desc,uqc,roll,age
1,Monali,"abhc,jkjk",,23,23
2,mj,nhiijkla,67,23,60
7,jena,"kdjuu,hsysi,juw",3,34,23
1,Monali,"/"coppers bars","rods and profiles"/",,23,23
2,money,"/"COUPLING","FLANGES & CROSS OVER"/",67,23,60
在上述数据中,id '2' 不在 desc 列值中。
我的创建语句:
create external table testing(id int,
name string,
desc string,
uqc double,
roll int,
age int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='^(\d+?),(.*?),"(.*)",([0-9.]*),([0-9]*),([0-9]*).*')
location ....
TBLPROPERTIES("skip.header.line.count"="1")
;
加载数据时我没有收到任何错误。但是当我做 select * from testing.select 语句时没有执行。上面的 Create 和 select 语句工作正常如果数据带有“,但如果数据带有和不带有”则不起作用。
试试这个标签:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
目前正则表达式中的第三组用引号引起来(引号是强制性的)。尝试使引号可选 "?
- 表示零个或一个引号,也使组内容非贪婪 (.*?)
,因此它不会在组内捕获额外的引号:
'input.regex'='^(\d+?),(.*?),"?(.*?)"?,([0-9.]*),(\d*),(\d*).*'
使用 regexp_replace 测试您的数据示例,我还在第三组周围添加了可选的斜线以将其从输出中删除:
with mytable as (
select stack(6,
'1,Monali,"abhc,jkjk",,23,23',
'2,mj,nhiijkla,67,23,60',
'7,jena,"kdjuu,hsysi,juw",3,34,23',
'1,Monali,"/"coppers bars","rods and profiles"/",,23,23',
'2,money,"/"COUPLING","FLANGES & CROSS OVER"/",67,23,60',
'2,money,"17",19"LCD PANEL FOR COMPUTER",67,23,60'
) as initial_data
)
select regexp_replace(initial_data,'^(\d+?),(.*?),"?/?(.*?)/?"?,([0-9.]*),(\d*),(\d*).*',
' || || || || || '
) as parsed_result
from mytable
结果(由两个竖线和空格分隔 ' || '
):
parsed_result
1 || Monali || abhc,jkjk || || 23 || 23
2 || mj || nhiijkla || 67 || 23 || 60
7 || jena || kdjuu,hsysi,juw || 3 || 34 || 23
1 || Monali || "coppers bars","rods and profiles" || || 23 || 23
2 || money || "COUPLING","FLANGES & CROSS OVER" || 67 || 23 || 60
2 || money || 17",19"LCD PANEL FOR COMPUTER || 67 || 23 || 60
因此,如果结果看起来不错,请在 table DDL 中使用此正则表达式:
'input.regex'='^(\d+?),(.*?),"?/?(.*?)/?"?,([0-9.]*),(\d*),(\d*).*'
在整个数据集上仔细测试它并检查 empty/null 值,必要时修复正则表达式。