Hive SQL:如何按最长时间对记录组进行子集化

Hive SQL: How to subset groups of records by max time

我有按 ID1 和 ID2 分组的记录,如下所示:

ID1      ID2         date_time
Apple pear         2020-03-09T12:11:25:622Z
Apple pear         2020-03-09T12:23:36:825Z
Apple lemon        2020-03-08T08:01:16:030Z
Apple lemon        2020-03-09T10:11:12:930Z
Apple lemon        2020-03-09T15:13:02:610Z
Lime  peach        2020-03-09T07:34:06:825Z
Lime peach         2020-03-09T07:54:12:220Z
Melon banana       2020-03-09T03:54:11:041Z
Melon banana       2020-03-09T09:22:10:220Z
Orange pear        2020-03-09T11:13:36:217Z
Orange pear        2020-03-09T11:23:26:040Z
Orange pear        2020-03-09T11:43:35:721Z

我正在尝试提取最大记录超过时间范围的记录,这样如果我想提取最大时间不超过 3 月 9 日中午的记录,上述记录将被子集化为:

ID1      ID2         date_time
Lime  peach        2020-03-09T07:34:06:825Z
Lime peach         2020-03-09T07:54:12:220Z
Melon banana       2020-03-09T03:54:11:041Z
Melon banana       2020-03-09T09:22:10:220Z
Orange pear        2020-03-09T11:13:36:217Z
Orange pear        2020-03-09T11:23:26:040Z
Orange pear        2020-03-09T11:43:35:721Z

感谢@leftjoin,我使用cast(regexp_replace('2020-03-09T16:05:06:827Z','(.*?)T(.*?):([^:]*?)Z$',' \.') as timestamp)将date_time转换为时间戳。

感谢您的帮助。

按时间子设置后出现添加的顶点问题:

注意:我只允许临时table。

首先:我通过 ID1 和 ID2 对原始数据集进行了去重

 create temporary table data1 as 
select id1, id2, name, value,
       cast(regexp_replace(oldtime,'(.*?)T(.*?):([^:]*?)Z$',' \.') as timestamp) as date_time
       from (
        select ID1, ID2,
        ROW_NUMBER() OVER (PARTITION BY id1, id2,name, value ORDER BY value) as rn,
        case when name like '%_TIME' then value end as date_time
        from df.data1
         )undup

        where undup.rn = 1 AND 
        value <> '' and value is not null
        order by ID1, ID2, date_time ASC   

然后我按时间进行子集化:

create temporary table data2 as
    select *
     from(
     select *,
           max(date_time) over (partition by id1,id2) as max_date
      from data1
      ) s

      where max_date <= '2020-03-12 12:00:00' 

在此之后我得到一个顶点错误..

 ERROR: Execute error: Error while processing statement: FAILED: Execution Error, return code 2 from 
        org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 2, vertexId=vertex_1583806821840_6890_2_01, 
        diagnostics=[Task failed, taskId=task_1583806821840_6890_2_01_000036, diagnostics=[TaskAttempt 0 failed, info=[Error: Error 
        while running task ( failure ) : attempt_1583806821840_6890_2_01_000036_0:java.lang.RuntimeException: 
        java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at 
        org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at 
        org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at 
        org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at 
        org.apache.tez.runtime.task.TaskRunner2Callable.run(TaskRunner2Callable.java:73) at 
        org.apache.tez.runtime.task.TaskRunner2Callable.run(TaskRunner2Callable.java:61) at 
        java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at 
        org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at 
        org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at 
        org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at 
        org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at 
        com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenab
        leFutureTask.java:108) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41) at 
        com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77) at 
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at 
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at 
        java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: 
        org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at 
        org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:304) at 
        org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:318) at 
        org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) ... 16 moreCaused by: 
        org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at 
        org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:378) at 
        org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:294) ... 18 moreCaused by: 
        java.lang.NullPointerException at 
        org.apache.hadoop.hive.ql.exec.persistence.PTFRowContainer.first(PTFRowContainer.java:115) at 
        org.apache.hadoop.hive.ql.exec.PTFPartition.iterator(PTFPartition.java:114) at 
        org.apache.hadoop.hive.ql.udf.ptf.BasePartitionEvaluator.getPartitionAgg(BasePartitionEvaluator.java:200) at 
        org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.evaluateFunctionOnPartition(WindowingTableFunction.java:155) at 
        org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.iterator(WindowingTableFunction.java:538) at 
        org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:349) at 
        org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:123) at 
        org.apache.hadoop.hive.ql.exec.Operator.baseForward(Operator.java:994) at 
        org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:940) at 
        org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:927) at 
        org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at 
        org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:363) ... 19 more], 
        TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) : 
        attempt_1583806821840_6890_2_01_000036_1:java.lang.RuntimeException: java.lang.RuntimeException: 
        org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at 
        org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at 
        org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at 
        org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at 
        org.apache.tez.runtime.task.TaskRunner2Callable.run(TaskRunner2Callable.java:73) at 
        org.apache.tez.runtime.task.TaskRunner2Callable.run(TaskRunner2Callable.java:61) at 
        java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at 
        org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at 
        org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at 
        org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at 
        org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at 
        com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenab
        leFutureTask.java:108) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41) at 
        com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77) at 
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at 
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at 
        java.lang.Thread.run(Thread.java:748)Caused by: java.lang.RuntimeException: 
        org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at 
        org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:304) at 
        org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:318) at 
        org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267) ... 16 moreCaused by: 
        org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at 
        org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:378) at 
        org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:294) ... 18 moreCaused by: 
        java.lang.NullPointerException at 
        org.apache.hadoop.hive.ql.exec.persistence.PTFRowContainer.first(PTFRowContainer.java:115) at 
        org.apache.hadoop.hive.ql.exec.PTFPartition.iterator(PTFPartition.java:114) at 
        org.apache.hadoop.hive.ql.udf.ptf.BasePartitionEvaluator.getPartitionAgg(BasePartitionEvaluator.java:200) at 
        org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.evaluateFunctionOnPartition(WindowingTableFunction.java:155) at 
        org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.iterator(WindowingTableFunction.java:538) at 
        org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:349) at 
        org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:123) at 
        org.apache.hadoop.hive.ql.exec.Operator.baseForward(Operator.java:994) at 
        org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:940) at 
        org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:927) at 
        org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at 
        org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:363) ... 19 more], 
        TaskAttempt 2 failed, info=[Error: Error while running task ( failure ) : 
        attempt_1583806821840_6890_2_01_000036_2:java.lang.RuntimeException: java.lang.RuntimeException: 
        org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row at 
        org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296) at 
        org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250) at 
        org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at 
        org.apache.tez.runtime.task.TaskRunner2Callable.run(TaskRunner2Callable

使用最大分析函数并用它过滤:

with your_data as (
select stack (12,
'Apple', 'pear'      ,  '2020-03-09T12:11:25:622Z',
'Apple', 'pear'      ,  '2020-03-09T12:23:36:825Z',
'Apple', 'lemon'     ,  '2020-03-08T08:01:16:030Z',
'Apple', 'lemon'     ,  '2020-03-09T10:11:12:930Z',
'Apple', 'lemon'     ,  '2020-03-09T15:13:02:610Z',
'Lime',  'peach'     ,  '2020-03-09T07:34:06:825Z',
'Lime', 'peach'      ,  '2020-03-09T07:54:12:220Z',
'Melon', 'banana'    ,  '2020-03-09T03:54:11:041Z',
'Melon', 'banana'    ,  '2020-03-09T09:22:10:220Z',
'Orange','pear'      , '2020-03-09T11:13:36:217Z',
'Orange','pear'      , '2020-03-09T11:23:26:040Z',
'Orange','pear'      , '2020-03-09T11:43:35:721Z'
) as (ID1,ID2,date_time)
)
select ID1,ID2,date_time
from
(
select max(timestamp(regexp_replace(date_time,'(.*?)T(.*?):([^:]*?)Z$',' \.'))) over (partition by ID1,ID2) as max_date,
       ID1,ID2,timestamp(regexp_replace(date_time,'(.*?)T(.*?):([^:]*?)Z$',' \.')) as date_time
  from your_data 
)s 
where max_date<='2020-03-09 12:00:00' 

结果:

id1     id2     date_time
Lime    peach   2020-03-09 07:54:12.22
Lime    peach   2020-03-09 07:34:06.825
Melon   banana  2020-03-09 09:22:10.22
Melon   banana  2020-03-09 03:54:11.041
Orange  pear    2020-03-09 11:43:35.721
Orange  pear    2020-03-09 11:23:26.04
Orange  pear    2020-03-09 11:13:36.217