配置单元 1.2 sql returns 意外的特殊字符

Question

运行以下 Hive 查询 returns 特殊字符：

SELECT t6.amt amt2,t6.color color
FROM(
 SELECT t5.color color, t5.c1 amt
 FROM(
  SELECT t1.c1 c1, t1.c2 AS color 
  from(
   SELECT  7716 AS c1, "Red" AS c2 UNION 
   SELECT  6203 AS c1, "Blue" AS c2
  ) t1
 ) t5
order by color) t6
ORDER BY color

它returns结果为

amt color
4   �
3   �

这是一个已知的配置单元错误吗？

解释计划

    Map 5 <- Union 2 (CONTAINS)
Reducer 3 <- Union 2 (SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 4
         File Output Operator [FS_331359]
            compressed:false
            Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
            table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Select Operator [SEL_331358]
            |  outputColumnNames:["_col0","_col1"]
            |  Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
            |<-Reducer 3 [SIMPLE_EDGE]
               Reduce Output Operator [RS_331357]
                  key expressions:_col1 (type: int)
                  sort order:+
                  Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                  value expressions:_col0 (type: string)
                  Select Operator [SEL_331351]
                     outputColumnNames:["_col0","_col1"]
                     Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                     Group By Operator [GBY_331350]
                     |  keys:KEY._col0 (type: int), KEY._col1 (type: string)
                     |  outputColumnNames:["_col0","_col1"]
                     |  Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                     |<-Union 2 [SIMPLE_EDGE]
                        |<-Map 1 [CONTAINS]
                        |  Reduce Output Operator [RS_331349]
                        |     key expressions:_col0 (type: int), _col1 (type: string)
                        |     Map-reduce partition columns:_col0 (type: int), _col1 (type: string)
                        |     sort order:++
                        |     Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                        |     Group By Operator [GBY_331348]
                        |        keys:_col0 (type: int), _col1 (type: string)
                        |        outputColumnNames:["_col0","_col1"]
                        |        Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                        |        Select Operator [SEL_331342]
                        |           outputColumnNames:["_col0","_col1"]
                        |           Statistics:Num rows: 1 Data size: 91 Basic stats: COMPLETE Column stats: COMPLETE
                        |           TableScan [TS_331341]
                        |              alias:_dummy_table
                        |              Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE
                        |<-Map 5 [CONTAINS]
                           Reduce Output Operator [RS_331349]
                              key expressions:_col0 (type: int), _col1 (type: string)
                              Map-reduce partition columns:_col0 (type: int), _col1 (type: string)
                              sort order:++
                              Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                              Group By Operator [GBY_331348]
                                 keys:_col0 (type: int), _col1 (type: string)
                                 outputColumnNames:["_col0","_col1"]
                                 Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                                 Select Operator [SEL_331344]
                                    outputColumnNames:["_col0","_col1"]
                                    Statistics:Num rows: 1 Data size: 92 Basic stats: COMPLETE Column stats: COMPLETE
                                    TableScan [TS_331343]
                                       alias:_dummy_table
                                       Statistics:Num rows: 1 Data size: 1 Basic stats: COMPLETE Column stats: COMPLETE

禁用或启用配置参数可以帮助我吗？

如果我颠倒最外面的列的顺序 select 那么查询 returns 预期结果。我本以为结果是

颜色值

蓝6203

红7716

Answer 1

我在我的 Hive 2.3 上用 MR 和 Tez 尝试了相同的查询，结果与你的相同。我关闭了所有查询优化、统计信息收集和 rcp，但结果保持不变。问题是 Hive 在单个 reducer 上制作 order by，因为您有两个顺序 order by 的 Hive 将它们合并到单个 reduce 阶段（如果您查看和扩展或格式化查询计划，很容易看出）。更准确地说，Hive 使用 _col0, _col1 等作为列别名，在 t5 子查询中你的键是 _col0 但在 t6 中它是 _col1 这就是为什么在 select运算符你看

expressions:: "_col1 (type: string), _col0 (type: int)"

和减少输出运算符

key expressions:: "_col1 (type: int)"

因此，Hive 在交换 select 列时如何切换键类型。如果 t5 和 t6 中的类型顺序相同则没有问题

key expressions:: "_col0 (type: string)"

如何避免这种情况——我真的不知道在单个减速器中进行顺序 order by 不是由于额外的优化。

配置单元 1.2 sql returns 意外的特殊字符

hive 1.2 sql returns unexpected special character

sql

hive

hiveql

hive-serde

hdp