将 table header 转换为列后,需要为每个指定列获取最大值
Need to get max value for each specified column after converting table header into a column
我需要 pointer/clue 来解决以下问题
问题陈述:我需要将所有 table header 转换为列 (col_name) 并获取所有这些列的最大值,我正在尝试以下逻辑但卡住了任何 suggestion/idea 都会很有帮助。
**from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import col,lit,max
df = sc.parallelize([ \
Row(name='Alice', age=5, height=80), \
Row(name='Mujkesh', age=10, height=90), \
Row(name='Ganesh', age=15, height=100)]).toDF().createOrReplaceTempView("Test")
df3 = spark.sql("Describe Test" )
df4= df3.withColumn("Max_val",max(col(age))).show()
given input :
+---+------+-------+
|age|height| name|
+---+------+-------+
| 5| 80| Alice|
| 10| 90|Mujkesh|
| 15| 100| Ganesh|
+---+------+-------+
expected output:
+--------+---------+-------+-------+
|col_name|data_type|comment|Max_val|
+--------+---------+-------+-------+
| age| bigint| null| 15|
| height| bigint| null| 100|
| name| string| null| null|
+--------+---------+-------+-------+**
尝试使用 stack
函数,然后分组以获得组的最大值。
- 然后加入 desc 数据框。
Example:
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = sc.parallelize([ \
Row(name='Alice', age=5, height=80), \
Row(name='Mujkesh', age=10, height=90), \
Row(name='Ganesh', age=15, height=100)]).toDF()
df.createOrReplaceTempView("Test")
df3 = spark.sql("desc Test" )
df4=df.selectExpr("stack(3,'name',bigint(name),'age',age,'height',height) as (col_name,data)").groupBy(col("col_name")).agg(max(col("data")).alias("Max_val"))
df5=df3.join(df4,['col_name'],'inner').orderBy("col_name")
df5.show()
#+--------+---------+-------+-------+
#|col_name|data_type|comment|Max_val|
#+--------+---------+-------+-------+
#| age| bigint| null| 15|
#| height| bigint| null| 100|
#| name| string| null| null|
#+--------+---------+-------+-------+
我需要 pointer/clue 来解决以下问题
问题陈述:我需要将所有 table header 转换为列 (col_name) 并获取所有这些列的最大值,我正在尝试以下逻辑但卡住了任何 suggestion/idea 都会很有帮助。
**from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import col,lit,max
df = sc.parallelize([ \
Row(name='Alice', age=5, height=80), \
Row(name='Mujkesh', age=10, height=90), \
Row(name='Ganesh', age=15, height=100)]).toDF().createOrReplaceTempView("Test")
df3 = spark.sql("Describe Test" )
df4= df3.withColumn("Max_val",max(col(age))).show()
given input :
+---+------+-------+
|age|height| name|
+---+------+-------+
| 5| 80| Alice|
| 10| 90|Mujkesh|
| 15| 100| Ganesh|
+---+------+-------+
expected output:
+--------+---------+-------+-------+
|col_name|data_type|comment|Max_val|
+--------+---------+-------+-------+
| age| bigint| null| 15|
| height| bigint| null| 100|
| name| string| null| null|
+--------+---------+-------+-------+**
尝试使用 stack
函数,然后分组以获得组的最大值。
- 然后加入 desc 数据框。
Example:
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import *
df = sc.parallelize([ \
Row(name='Alice', age=5, height=80), \
Row(name='Mujkesh', age=10, height=90), \
Row(name='Ganesh', age=15, height=100)]).toDF()
df.createOrReplaceTempView("Test")
df3 = spark.sql("desc Test" )
df4=df.selectExpr("stack(3,'name',bigint(name),'age',age,'height',height) as (col_name,data)").groupBy(col("col_name")).agg(max(col("data")).alias("Max_val"))
df5=df3.join(df4,['col_name'],'inner').orderBy("col_name")
df5.show()
#+--------+---------+-------+-------+
#|col_name|data_type|comment|Max_val|
#+--------+---------+-------+-------+
#| age| bigint| null| 15|
#| height| bigint| null| 100|
#| name| string| null| null|
#+--------+---------+-------+-------+