pyspark中的排名日期列
Ranking date column in pyspark
我在 pyspark 中有以下数据框:
>>> df.show()
+----------+------+
| date_col|counts|
+----------+------+
|2022-02-05|350647|
|2022-02-06|313091|
+----------+------+
我想创建一个结果数据框,它按升序排列 date_col:
>>> df.show()
+----------+------+---------+
| date_col|counts|order_col|
+----------+------+---------+
|2022-02-05|350647| 2|
|2022-02-06|313091| 1|
+----------+------+---------+
我们怎样才能做到这一点?
以下脚本可用于创建数据帧 df:
from datetime import datetime, date
from pyspark.sql import Row
from pyspark.sql import SparkSession
df = spark.createDataFrame([
Row(date_col=date(2022, 02, 05), count=350647 ),
Row(date_col=date(2022, 02, 06), count=313091 ),
])
df.show()
您可以使用 Rank , in conjunction with Window
轻松做到这一点
数据准备
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
d = {
'date_col':['2022-02-05', '2022-02-06', '2022-02-07', '2022-02-08'],
'counts':[350647, 313091, 317791, 312145],
}
sparkDF = sql.createDataFrame(pd.DataFrame(d))
sparkDF.show()
+----------+------+
| date_col|counts|
+----------+------+
|2022-02-05|350647|
|2022-02-06|313091|
|2022-02-07|317791|
|2022-02-08|312145|
+----------+------+
排名
window = Window.orderBy(F.col('date_col').desc())
sparkDF = sparkDF.withColumn('order_col',F.rank().over(window))
sparkDF.show()
+----------+------+---------+
| date_col|counts|order_col|
+----------+------+---------+
|2022-02-08|312145| 1|
|2022-02-07|317791| 2|
|2022-02-06|313091| 3|
|2022-02-05|350647| 4|
+----------+------+---------+
排名 - SparkSQL
sql.sql(
"""
SELECT
date_col
,counts
,RANK() OVER( ORDER BY date_col DESC) as order_col
FROM TB1
"""
).show()
+----------+------+---------+
| date_col|counts|order_col|
+----------+------+---------+
|2022-02-08|312145| 1|
|2022-02-07|317791| 2|
|2022-02-06|313091| 3|
|2022-02-05|350647| 4|
+----------+------+---------+
我在 pyspark 中有以下数据框:
>>> df.show()
+----------+------+
| date_col|counts|
+----------+------+
|2022-02-05|350647|
|2022-02-06|313091|
+----------+------+
我想创建一个结果数据框,它按升序排列 date_col:
>>> df.show()
+----------+------+---------+
| date_col|counts|order_col|
+----------+------+---------+
|2022-02-05|350647| 2|
|2022-02-06|313091| 1|
+----------+------+---------+
我们怎样才能做到这一点?
以下脚本可用于创建数据帧 df:
from datetime import datetime, date
from pyspark.sql import Row
from pyspark.sql import SparkSession
df = spark.createDataFrame([
Row(date_col=date(2022, 02, 05), count=350647 ),
Row(date_col=date(2022, 02, 06), count=313091 ),
])
df.show()
您可以使用 Rank , in conjunction with Window
轻松做到这一点数据准备
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
d = {
'date_col':['2022-02-05', '2022-02-06', '2022-02-07', '2022-02-08'],
'counts':[350647, 313091, 317791, 312145],
}
sparkDF = sql.createDataFrame(pd.DataFrame(d))
sparkDF.show()
+----------+------+
| date_col|counts|
+----------+------+
|2022-02-05|350647|
|2022-02-06|313091|
|2022-02-07|317791|
|2022-02-08|312145|
+----------+------+
排名
window = Window.orderBy(F.col('date_col').desc())
sparkDF = sparkDF.withColumn('order_col',F.rank().over(window))
sparkDF.show()
+----------+------+---------+
| date_col|counts|order_col|
+----------+------+---------+
|2022-02-08|312145| 1|
|2022-02-07|317791| 2|
|2022-02-06|313091| 3|
|2022-02-05|350647| 4|
+----------+------+---------+
排名 - SparkSQL
sql.sql(
"""
SELECT
date_col
,counts
,RANK() OVER( ORDER BY date_col DESC) as order_col
FROM TB1
"""
).show()
+----------+------+---------+
| date_col|counts|order_col|
+----------+------+---------+
|2022-02-08|312145| 1|
|2022-02-07|317791| 2|
|2022-02-06|313091| 3|
|2022-02-05|350647| 4|
+----------+------+---------+