计算 PySpark 中给定另一列的唯一列值
Count unique column values given another column in PySpark
我正在尝试计算 Pyspark 中每个唯一 ID
的 Date
。
+-------------------+----------+
| Date| ID|
+-------------------+----------+
|2022-03-19 00:00:00| Ax3838J|
|2022-03-11 00:00:00| Ax3838J|
|2021-11-01 00:00:00| Ax3838J|
|2021-10-27 00:00:00| Ax3838J|
|2021-10-25 00:00:00| Bz3838J|
|2021-10-22 00:00:00| Bz3838J|
|2021-10-18 00:00:00| Bz3838J|
|2021-10-15 00:00:00| Rr7422u|
|2021-09-22 00:00:00| Rr742uL|
+-------------------+----------+
当我尝试时
df.groupBy('ID').count('Date').show()
我收到错误:
_api() takes 1 positional argument but 2 were given
这是有道理的,但我不确定在 PySpark 中还有哪些其他技术可以算作如此。
如何用这个计算唯一 Date
值:
df.groupBy('ID').count().show()
预期输出:
+-------------------+----------+
| Date| ID|
+-------------------+----------+
| 4| Ax3838J|
| 3| Bz3838J|
| 2| Rr742uL|
+-------------------+----------+
试试这个:
df.groupBy('ID').count(distinct 'Date').show()
请找到预期输出的工作版本。我是 spark-3 上的 运行 代码。
from pyspark.sql.functions import countDistinct
data = [["2022-03-19 00:00:00", "Ax3838J"], ["2022-03-11 00:00:00", "Ax3838J"], ["2021-11-01 00:00:00", "Ax3838J"], ["2021-10-27 00:00:00", "Ax3838J"], ["2021-10-25 00:00:00", "Bz3838J"], ["2021-10-22 00:00:00", "Bz3838J"], ["2021-10-18 00:00:00", "Bz3838J"], ["2021-10-15 00:00:00", "Rr7422u"], ["2021-09-22 00:00:00", "Rr742uL"]]
df = spark.createDataFrame(data, ['Date', 'ID'])
df.show()
+-------------------+-------+
| Date| ID|
+-------------------+-------+
|2022-03-19 00:00:00|Ax3838J|
|2022-03-11 00:00:00|Ax3838J|
|2021-11-01 00:00:00|Ax3838J|
|2021-10-27 00:00:00|Ax3838J|
|2021-10-25 00:00:00|Bz3838J|
|2021-10-22 00:00:00|Bz3838J|
|2021-10-18 00:00:00|Bz3838J|
|2021-10-15 00:00:00|Rr742uL|
|2021-09-22 00:00:00|Rr742uL|
+-------------------+-------+
df.groupby("ID").agg(countDistinct("Date").alias("count")).show()
+-------+-----+
| ID|count|
+-------+-----+
|Rr742uL| 2|
|Ax3838J| 4|
|Bz3838J| 3|
+-------+-----+
如果您需要任何帮助,请告诉我,如果它解决了您的问题,请接受它
我正在尝试计算 Pyspark 中每个唯一 ID
的 Date
。
+-------------------+----------+
| Date| ID|
+-------------------+----------+
|2022-03-19 00:00:00| Ax3838J|
|2022-03-11 00:00:00| Ax3838J|
|2021-11-01 00:00:00| Ax3838J|
|2021-10-27 00:00:00| Ax3838J|
|2021-10-25 00:00:00| Bz3838J|
|2021-10-22 00:00:00| Bz3838J|
|2021-10-18 00:00:00| Bz3838J|
|2021-10-15 00:00:00| Rr7422u|
|2021-09-22 00:00:00| Rr742uL|
+-------------------+----------+
当我尝试时
df.groupBy('ID').count('Date').show()
我收到错误:
_api() takes 1 positional argument but 2 were given
这是有道理的,但我不确定在 PySpark 中还有哪些其他技术可以算作如此。
如何用这个计算唯一 Date
值:
df.groupBy('ID').count().show()
预期输出:
+-------------------+----------+
| Date| ID|
+-------------------+----------+
| 4| Ax3838J|
| 3| Bz3838J|
| 2| Rr742uL|
+-------------------+----------+
试试这个:
df.groupBy('ID').count(distinct 'Date').show()
请找到预期输出的工作版本。我是 spark-3 上的 运行 代码。
from pyspark.sql.functions import countDistinct
data = [["2022-03-19 00:00:00", "Ax3838J"], ["2022-03-11 00:00:00", "Ax3838J"], ["2021-11-01 00:00:00", "Ax3838J"], ["2021-10-27 00:00:00", "Ax3838J"], ["2021-10-25 00:00:00", "Bz3838J"], ["2021-10-22 00:00:00", "Bz3838J"], ["2021-10-18 00:00:00", "Bz3838J"], ["2021-10-15 00:00:00", "Rr7422u"], ["2021-09-22 00:00:00", "Rr742uL"]]
df = spark.createDataFrame(data, ['Date', 'ID'])
df.show()
+-------------------+-------+
| Date| ID|
+-------------------+-------+
|2022-03-19 00:00:00|Ax3838J|
|2022-03-11 00:00:00|Ax3838J|
|2021-11-01 00:00:00|Ax3838J|
|2021-10-27 00:00:00|Ax3838J|
|2021-10-25 00:00:00|Bz3838J|
|2021-10-22 00:00:00|Bz3838J|
|2021-10-18 00:00:00|Bz3838J|
|2021-10-15 00:00:00|Rr742uL|
|2021-09-22 00:00:00|Rr742uL|
+-------------------+-------+
df.groupby("ID").agg(countDistinct("Date").alias("count")).show()
+-------+-----+
| ID|count|
+-------+-----+
|Rr742uL| 2|
|Ax3838J| 4|
|Bz3838J| 3|
+-------+-----+
如果您需要任何帮助,请告诉我,如果它解决了您的问题,请接受它