Hive 中的左连接产生特殊的结果

Left Join in Hive Produces Peculiar Results

我有三个 table,我想通过一些公共列将它们一一连接在一起,尽管数据存在(在第二个中),但最终结果总是包含大量空值左连接操作)在两个 tables.

这是我的代码 运行:

SELECT T1.INNERCODE                     AS CODE,
       T2.DATESTAMP,
       T1.REPORTDATE,
       t1.FUNDINNERCODE,
       T1.MARKETVALUE,
       T1.SHARESHOLDING,
       T1.RATIOINNV,
       T3.UNIT_NAV,
       T3.VALUE,
       T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
         LEFT JOIN (SELECT CALENDAR_DATE           AS CALENDAR_DATE,
                           CLOSEST_DATETIME_BEFORE AS DATESTAMP
                    FROM APP_RQA.T_CALENDAR_SS
                    WHERE EXCHMARKET = '83') T2
                   ON T1.REPORTDATE = T2.CALENDAR_DATE
         LEFT JOIN (SELECT CODE,
                           DATESTAMP,
                           UNIT_NAV,
                           VALUE,
                           RETURNRATE
                    FROM APP_RQA.T_FUND_NAV_SS) T3
                   ON T1.FUNDINNERCODE = T3.CODE
                       AND T2.DATESTAMP = T3.DATESTAMP

以下是最终结果。如您所见,来自最后一个 table 的列 unit_nav、returnrate 和 value 大部分是空的

例如fundinnercode = 4082, datestamp = 2010-06-30(图中高亮显示),unit_nav,returnrate和value均为null。但是这一行出现在 table t_fund_nav_ss 中,如下所示:

更奇怪的是,当我使用 where 子句专门定位最终结果中的行时,缺失的三个列似乎包含数据

SELECT T1.INNERCODE                     AS CODE,
       T2.DATESTAMP,
       T1.REPORTDATE,
       t1.FUNDINNERCODE,
       T1.MARKETVALUE,
       T1.SHARESHOLDING,
       T1.RATIOINNV,
       T3.UNIT_NAV,
       T3.VALUE,
       T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
         LEFT JOIN (SELECT CALENDAR_DATE           AS CALENDAR_DATE,
                           CLOSEST_DATETIME_BEFORE AS DATESTAMP
                    FROM APP_RQA.T_CALENDAR_SS
                    WHERE EXCHMARKET = '83') T2
                   ON T1.REPORTDATE = T2.CALENDAR_DATE
         LEFT JOIN (SELECT CODE,
                           DATESTAMP,
                           UNIT_NAV,
                           VALUE,
                           RETURNRATE
                    FROM APP_RQA.T_FUND_NAV_SS) T3
                   ON T1.FUNDINNERCODE = T3.CODE
                       AND T2.DATESTAMP = T3.DATESTAMP
WHERE T1.FUNDINNERCODE = 4082
  AND T2.DATESTAMP = '2010-06-30'

我无法理解它,非常感谢任何帮助或建议。

感谢大家的评论。今天偶然遇到一位大数据工程师,向他请教了这个问题。对于那些好奇的人,原来问题是关于 T3 table 的大小。 T1 和 T2 有大约 20-30 k 行,而 T3 有 3000 万行。并且左连接操作导致数据丢失。所以我首先将 T3 过滤到大约 200 万行,结果现在看起来很正常。