为什么我没有从该联合会获得所有专栏?
Why am I not getting all columns from this Union?
我试图通过调整两个不同 tables 的构造逻辑将两个不同的笔记本统一为一个。
第一个是:
spark.sql(''' SELECT CD_CLI,
MAX(VL_RPTD_UTZO) AS MAX_VL_RPTD_UTZO,
'2017-01-31' AS DT_MVTC
FROM vl_rptd_utzo
WHERE DT_EXTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
''').createOrReplaceTempView('vl_rptd_max_utzo_2017_01_31')
第二个:
spark.sql('''SELECT CD_CLI,
CASE WHEN SUM(in_lim_crt) > 0
THEN ROUND(SUM(SUM_VL_TTL_FAT)/SUM(in_lim_crt), 4)
ELSE -99999999999
END AS VL_MED_FAT,
'2017-01-31' as DT_MVTC
FROM in_lim_fat
WHERE DT_MVTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
''').createOrReplaceTempView('media_vl_fatura_2017_01_31')
我的,也许是天真的?,方法是合并两个选择,因为它们从相同的来源调用相同的字段:
spark.sql('''SELECT CD_CLI,
CASE WHEN SUM(in_lim_crt) > 0
THEN ROUND(SUM(SUM_VL_TTL_FAT)/SUM(in_lim_crt), 4)
ELSE -99999999999
END AS VL_MED_FAT,
'2017-01-31' as DT_MVTC
FROM in_lim_fat
WHERE DT_MVTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
UNION
SELECT CD_CLI,
MAX(VL_RPTD_UTZO) AS MAX_VL_RPTD_UTZO,
'2017-01-31' AS DT_MVTC
FROM vl_rptd_utzo
WHERE DT_EXTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
''').createOrReplaceTempView('new_table')
但是当我要求描述时:
spark.sql('describe new_table').show(10, False)
输出为:
+----------+-------------+-------+
|col_name |data_type |comment|
+----------+-------------+-------+
|CD_CLI |int |null |
|VL_MED_FAT|decimal(38,4)|null |
|DT_MVTC |string |null |
+----------+-------------+-------+
为什么 MAX_VL_RPTD_UTZO 没有出现在新的 table 中?我是 sql 的新手,也许这很幼稚和简单,但我无法解决它。
您的第一个 select 有 CD_CLI
、VL_MED_FAT
和 DT_MVTC
您的第二个 select 有 CD_CLI
、MAX_VL_RPTD_UTZO
和 DT_MVTC
Spark 将使用第一个查询的列名作为模式并将其应用于联合中的其他后续查询,因此 MAX_VL_RPTD_UTZO
的值将出现在 VL_MED_FAT
.
编辑#1:如果你想有 4 列,那么它必须在 2 个查询之间保持一致,所以像这样
select CD_CLI, VL_MED_FAT, null as MAX_VL_RPTD_UTZO, DT_MVTC from ...
union
select CD_CLI, null as VL_MED_FAT, MAX_VL_RPTD_UTZO, DT_MVTC from ...
我试图通过调整两个不同 tables 的构造逻辑将两个不同的笔记本统一为一个。
第一个是:
spark.sql(''' SELECT CD_CLI,
MAX(VL_RPTD_UTZO) AS MAX_VL_RPTD_UTZO,
'2017-01-31' AS DT_MVTC
FROM vl_rptd_utzo
WHERE DT_EXTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
''').createOrReplaceTempView('vl_rptd_max_utzo_2017_01_31')
第二个:
spark.sql('''SELECT CD_CLI,
CASE WHEN SUM(in_lim_crt) > 0
THEN ROUND(SUM(SUM_VL_TTL_FAT)/SUM(in_lim_crt), 4)
ELSE -99999999999
END AS VL_MED_FAT,
'2017-01-31' as DT_MVTC
FROM in_lim_fat
WHERE DT_MVTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
''').createOrReplaceTempView('media_vl_fatura_2017_01_31')
我的,也许是天真的?,方法是合并两个选择,因为它们从相同的来源调用相同的字段:
spark.sql('''SELECT CD_CLI,
CASE WHEN SUM(in_lim_crt) > 0
THEN ROUND(SUM(SUM_VL_TTL_FAT)/SUM(in_lim_crt), 4)
ELSE -99999999999
END AS VL_MED_FAT,
'2017-01-31' as DT_MVTC
FROM in_lim_fat
WHERE DT_MVTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
UNION
SELECT CD_CLI,
MAX(VL_RPTD_UTZO) AS MAX_VL_RPTD_UTZO,
'2017-01-31' AS DT_MVTC
FROM vl_rptd_utzo
WHERE DT_EXTC BETWEEN '2016-07-31' AND '2016-12-31'
GROUP BY CD_CLI
''').createOrReplaceTempView('new_table')
但是当我要求描述时:
spark.sql('describe new_table').show(10, False)
输出为:
+----------+-------------+-------+
|col_name |data_type |comment|
+----------+-------------+-------+
|CD_CLI |int |null |
|VL_MED_FAT|decimal(38,4)|null |
|DT_MVTC |string |null |
+----------+-------------+-------+
为什么 MAX_VL_RPTD_UTZO 没有出现在新的 table 中?我是 sql 的新手,也许这很幼稚和简单,但我无法解决它。
您的第一个 select 有 CD_CLI
、VL_MED_FAT
和 DT_MVTC
您的第二个 select 有 CD_CLI
、MAX_VL_RPTD_UTZO
和 DT_MVTC
Spark 将使用第一个查询的列名作为模式并将其应用于联合中的其他后续查询,因此 MAX_VL_RPTD_UTZO
的值将出现在 VL_MED_FAT
.
编辑#1:如果你想有 4 列,那么它必须在 2 个查询之间保持一致,所以像这样
select CD_CLI, VL_MED_FAT, null as MAX_VL_RPTD_UTZO, DT_MVTC from ...
union
select CD_CLI, null as VL_MED_FAT, MAX_VL_RPTD_UTZO, DT_MVTC from ...