在数据块 SQL 的字段中输出分号分隔值
output semicolon separated values in field in databricks SQL
期望的结果:
+---------+-----------------------------+
| ID PR | Related Repeating Event(s) |
+---------+-----------------------------+
| 1658503 | 1615764;1639329 |
+---------+-----------------------------+
有没有一种方法可以在不使用用户定义的聚合函数 (UDAF) 的情况下在 sql/databricks 中编写查询?我已经尝试过 concat()、GROUP_CONCAT()、LISTAGG,但是 none 这些工作或在数据块中不受支持(“此函数既不是注册的临时函数,也不是数据库中注册的永久函数'default'...
我在数据块文档中找到了这个用户定义的聚合函数 (UDAF) 描述,但不知道如何实现它 (https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-functions-udf-aggregate.html#user-defined-aggregate-functions-udafs&language-sql)
有人可以给我提示或 link 吗?
我有这个基本查询:
%sql
SELECT
pr_id,
data_field_nm,
field_value
FROM
gms_us_mart.txn_pr_addtl_data_detail_trkw_glbl --(18)
WHERE
pr_id = 1658503
AND data_field_nm = 'Related Repeating Deviation(s)'
输出为:
+---------+--------------------------------+-------------+
| pr_id | data_field_nm | field_value |
+---------+--------------------------------+-------------+
| 1658503 | Related Repeating Deviation(s) | 1615764 |
| 1658503 | Related Repeating Deviation(s) | 1639329 |
+---------+--------------------------------+-------------+
正确答案是(感谢@Alex Ott):
%sql
SELECT
pr_id AS IDPR,
concat_ws(';', collect_list(field_value)) AS RelatedRepeatingDeviations
FROM
gms_us_mart.txn_pr_addtl_data_detail_trkw_glbl
WHERE
data_field_nm = 'Related Repeating Deviation(s)'
AND pr_id = 1658503
GROUP BY
pr_id,
data_field_nm;
给出期望的结果:
+---------+-----------------------------+
| IDPR | RelatedRepeatingDeviations |
+---------+-----------------------------+
| 1658503 | 1615764;1639329 |
+---------+-----------------------------+
只需将 group by
与 collect_list
和 concat_ws
一起使用,就像这样:
- 获取数据
from pyspark.sql import Row
df = spark.createDataFrame([Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1615764}),
Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1639329})])
df.createOrReplaceTempView("abc")
- 并进行查询:
%sql
select pr_id,
data_field_nm,
concat_ws(';', collect_list(field_value)) as combined
from abc
group by pr_id, data_field_nm
虽然这会给你固定名称的列 (combined
)
期望的结果:
+---------+-----------------------------+
| ID PR | Related Repeating Event(s) |
+---------+-----------------------------+
| 1658503 | 1615764;1639329 |
+---------+-----------------------------+
有没有一种方法可以在不使用用户定义的聚合函数 (UDAF) 的情况下在 sql/databricks 中编写查询?我已经尝试过 concat()、GROUP_CONCAT()、LISTAGG,但是 none 这些工作或在数据块中不受支持(“此函数既不是注册的临时函数,也不是数据库中注册的永久函数'default'...
我在数据块文档中找到了这个用户定义的聚合函数 (UDAF) 描述,但不知道如何实现它 (https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-functions-udf-aggregate.html#user-defined-aggregate-functions-udafs&language-sql)
有人可以给我提示或 link 吗?
我有这个基本查询:
%sql
SELECT
pr_id,
data_field_nm,
field_value
FROM
gms_us_mart.txn_pr_addtl_data_detail_trkw_glbl --(18)
WHERE
pr_id = 1658503
AND data_field_nm = 'Related Repeating Deviation(s)'
输出为:
+---------+--------------------------------+-------------+
| pr_id | data_field_nm | field_value |
+---------+--------------------------------+-------------+
| 1658503 | Related Repeating Deviation(s) | 1615764 |
| 1658503 | Related Repeating Deviation(s) | 1639329 |
+---------+--------------------------------+-------------+
正确答案是(感谢@Alex Ott):
%sql
SELECT
pr_id AS IDPR,
concat_ws(';', collect_list(field_value)) AS RelatedRepeatingDeviations
FROM
gms_us_mart.txn_pr_addtl_data_detail_trkw_glbl
WHERE
data_field_nm = 'Related Repeating Deviation(s)'
AND pr_id = 1658503
GROUP BY
pr_id,
data_field_nm;
给出期望的结果:
+---------+-----------------------------+
| IDPR | RelatedRepeatingDeviations |
+---------+-----------------------------+
| 1658503 | 1615764;1639329 |
+---------+-----------------------------+
只需将 group by
与 collect_list
和 concat_ws
一起使用,就像这样:
- 获取数据
from pyspark.sql import Row
df = spark.createDataFrame([Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1615764}),
Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1639329})])
df.createOrReplaceTempView("abc")
- 并进行查询:
%sql
select pr_id,
data_field_nm,
concat_ws(';', collect_list(field_value)) as combined
from abc
group by pr_id, data_field_nm
虽然这会给你固定名称的列 (combined
)