Hive 中的左连接产生特殊的结果
Left Join in Hive Produces Peculiar Results
我有三个 table,我想通过一些公共列将它们一一连接在一起,尽管数据存在(在第二个中),但最终结果总是包含大量空值左连接操作)在两个 tables.
这是我的代码 运行:
SELECT T1.INNERCODE AS CODE,
T2.DATESTAMP,
T1.REPORTDATE,
t1.FUNDINNERCODE,
T1.MARKETVALUE,
T1.SHARESHOLDING,
T1.RATIOINNV,
T3.UNIT_NAV,
T3.VALUE,
T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
LEFT JOIN (SELECT CALENDAR_DATE AS CALENDAR_DATE,
CLOSEST_DATETIME_BEFORE AS DATESTAMP
FROM APP_RQA.T_CALENDAR_SS
WHERE EXCHMARKET = '83') T2
ON T1.REPORTDATE = T2.CALENDAR_DATE
LEFT JOIN (SELECT CODE,
DATESTAMP,
UNIT_NAV,
VALUE,
RETURNRATE
FROM APP_RQA.T_FUND_NAV_SS) T3
ON T1.FUNDINNERCODE = T3.CODE
AND T2.DATESTAMP = T3.DATESTAMP
以下是最终结果。如您所见,来自最后一个 table 的列 unit_nav、returnrate 和 value 大部分是空的
例如fundinnercode = 4082, datestamp = 2010-06-30(图中高亮显示),unit_nav,returnrate和value均为null。但是这一行出现在 table t_fund_nav_ss 中,如下所示:
更奇怪的是,当我使用 where 子句专门定位最终结果中的行时,缺失的三个列似乎包含数据
SELECT T1.INNERCODE AS CODE,
T2.DATESTAMP,
T1.REPORTDATE,
t1.FUNDINNERCODE,
T1.MARKETVALUE,
T1.SHARESHOLDING,
T1.RATIOINNV,
T3.UNIT_NAV,
T3.VALUE,
T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
LEFT JOIN (SELECT CALENDAR_DATE AS CALENDAR_DATE,
CLOSEST_DATETIME_BEFORE AS DATESTAMP
FROM APP_RQA.T_CALENDAR_SS
WHERE EXCHMARKET = '83') T2
ON T1.REPORTDATE = T2.CALENDAR_DATE
LEFT JOIN (SELECT CODE,
DATESTAMP,
UNIT_NAV,
VALUE,
RETURNRATE
FROM APP_RQA.T_FUND_NAV_SS) T3
ON T1.FUNDINNERCODE = T3.CODE
AND T2.DATESTAMP = T3.DATESTAMP
WHERE T1.FUNDINNERCODE = 4082
AND T2.DATESTAMP = '2010-06-30'
我无法理解它,非常感谢任何帮助或建议。
感谢大家的评论。今天偶然遇到一位大数据工程师,向他请教了这个问题。对于那些好奇的人,原来问题是关于 T3 table 的大小。 T1 和 T2 有大约 20-30 k 行,而 T3 有 3000 万行。并且左连接操作导致数据丢失。所以我首先将 T3 过滤到大约 200 万行,结果现在看起来很正常。
我有三个 table,我想通过一些公共列将它们一一连接在一起,尽管数据存在(在第二个中),但最终结果总是包含大量空值左连接操作)在两个 tables.
这是我的代码 运行:
SELECT T1.INNERCODE AS CODE,
T2.DATESTAMP,
T1.REPORTDATE,
t1.FUNDINNERCODE,
T1.MARKETVALUE,
T1.SHARESHOLDING,
T1.RATIOINNV,
T3.UNIT_NAV,
T3.VALUE,
T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
LEFT JOIN (SELECT CALENDAR_DATE AS CALENDAR_DATE,
CLOSEST_DATETIME_BEFORE AS DATESTAMP
FROM APP_RQA.T_CALENDAR_SS
WHERE EXCHMARKET = '83') T2
ON T1.REPORTDATE = T2.CALENDAR_DATE
LEFT JOIN (SELECT CODE,
DATESTAMP,
UNIT_NAV,
VALUE,
RETURNRATE
FROM APP_RQA.T_FUND_NAV_SS) T3
ON T1.FUNDINNERCODE = T3.CODE
AND T2.DATESTAMP = T3.DATESTAMP
以下是最终结果。如您所见,来自最后一个 table 的列 unit_nav、returnrate 和 value 大部分是空的
例如fundinnercode = 4082, datestamp = 2010-06-30(图中高亮显示),unit_nav,returnrate和value均为null。但是这一行出现在 table t_fund_nav_ss 中,如下所示:
更奇怪的是,当我使用 where 子句专门定位最终结果中的行时,缺失的三个列似乎包含数据
SELECT T1.INNERCODE AS CODE,
T2.DATESTAMP,
T1.REPORTDATE,
t1.FUNDINNERCODE,
T1.MARKETVALUE,
T1.SHARESHOLDING,
T1.RATIOINNV,
T3.UNIT_NAV,
T3.VALUE,
T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
LEFT JOIN (SELECT CALENDAR_DATE AS CALENDAR_DATE,
CLOSEST_DATETIME_BEFORE AS DATESTAMP
FROM APP_RQA.T_CALENDAR_SS
WHERE EXCHMARKET = '83') T2
ON T1.REPORTDATE = T2.CALENDAR_DATE
LEFT JOIN (SELECT CODE,
DATESTAMP,
UNIT_NAV,
VALUE,
RETURNRATE
FROM APP_RQA.T_FUND_NAV_SS) T3
ON T1.FUNDINNERCODE = T3.CODE
AND T2.DATESTAMP = T3.DATESTAMP
WHERE T1.FUNDINNERCODE = 4082
AND T2.DATESTAMP = '2010-06-30'
我无法理解它,非常感谢任何帮助或建议。
感谢大家的评论。今天偶然遇到一位大数据工程师,向他请教了这个问题。对于那些好奇的人,原来问题是关于 T3 table 的大小。 T1 和 T2 有大约 20-30 k 行,而 T3 有 3000 万行。并且左连接操作导致数据丢失。所以我首先将 T3 过滤到大约 200 万行,结果现在看起来很正常。