如何删除 posexplode 配置单元上的重复项

how to remove duplicates on posexplode hive

我正在做姿势爆炸, 如何删除因解析列而导致的重复行? 我不能做 distinct 因为有几个列(在解析之前)为空。

posexplode 之前的示例:

id   | cofe         |
AAA  |  |||-9000| 4 |
BBB  |   5|90       |
CCC  |              |
DDD  |  6|||||      |
EEE  |              |

不幸的是,结果

id   | cofe
AAA  |  
AAA  |  
AAA  |  -9000
AAA  |    4
BBB  |    5
BBB  |   90
CCC  |   
DDD  |   6
DDD  |   
DDD  |   
DDD  |   
DDD  |   
EEE  |

预期结果

id   | cofe
AAA  |  -9000
AAA  |    4
BBB  |    5
BBB  |   90
CCC  |   
DDD  |   6 
EEE  |
SELECT qq.id, 
ss.cofe,
ss.fnte,
ss.cnte
from
(
select id, 
sequence,  
split (BMWA, '~')[15] AS CFEEA
split (BMWA, '~')[16] AS FTAAA 
split (BMWA, '~')[17] AS CNTTA
FROM 
( 
select id,  
sequence,  
replace(bmw, '^','~') AS BMWA
from tablee
)rr 
)qq
lateral view posexplode(split(replace(qq.CFEEA'|','~'),'~')) ss as r, cofe,
lateral view posexplode(split(replace(qq.FTAAA'|','~'),'~')) ss as r, fnte
lateral view posexplode(split(replace(qq.CNTTA'|','~'),'~')) ss as r, cnte

任何想法将不胜感激!

如果要在拆分字符串时跳过空元素,请在拆分前将连续的分隔符替换为单个分隔符,同时删除开头和结尾的分隔符。

例如 '|||-9000| 4'(以竖线分隔)

select  split(
        regexp_replace(
        --replace consecutive 2+ delimiters with single one
        regexp_replace ('|||-9000| 4','\|{2,}','|'), --gives '|-9000| 4'
        --remove start and end delimiter
        '^\||\|$',''),                              --gives '-9000| 4'
        --split
        '\|')                                        --gives array ["-9000"," 4"]

您的数据示例:

with mytable as (
select stack (5,
'AAA','|||-9000| 4',
'BBB',' 5|90',
'CCC','',
'DDD','6|||||',
'EEE',''
) as (id,cofe )
)

select id, e.val as cofe
  from mytable
       lateral view outer posexplode(
       split(
        regexp_replace(
        regexp_replace (cofe,'\|{2,}','|'), 
        '^\||\|$',''),                          
        '\|')
      ) e as pos, val

结果:

    id  cofe    
   AAA    -9000
   AAA    4
   BBB    5
   BBB    90
   CCC
   DDD    6
   EEE

此外,多个 LATERAL VIEW posexplode 可以为每一行生成爆炸值的笛卡尔积。请参阅有关如何按位置分解多个不同长度数组的答案