如何限制和分区 PySpark Dataframe 中的数据

Question

我有以下数据

+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
|restaurant_id|     restaurant_name|     city|state|postal_code|               stars|review_count|cuisine_name|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
|        62112|      Neptune Oyster|   Boston|   MA|      02113|4.500000000000000000|        5115|    American|
|        62112|      Neptune Oyster|   Boston|   MA|      02113|4.500000000000000000|        5115|        Thai|
|        60154|Giacomo's Ristora...|   Boston|   MA|      02113|4.000000000000000000|        3520|     Italian|
|        61455|Atlantic Fish Com...|   Boston|   MA|      02116|4.000000000000000000|        2575|    American|
|        57757|      Top of the Hub|   Boston|   MA|      02199|3.500000000000000000|        2273|    American|
|        58631|         Carmelina's|   Boston|   MA|      02113|4.500000000000000000|        2250|     Italian|
|        58895|         The Beehive|   Boston|   MA|      02116|3.500000000000000000|        2184|    American|
|        56517|Lolita Cocina & T...|   Boston|   MA|      02116|4.000000000000000000|        2179|    American|
|        56517|Lolita Cocina & T...|   Boston|   MA|      02116|4.000000000000000000|        2179|     Mexican|
|        58440|                Toro|   Boston|   MA|      02118|4.000000000000000000|        2175|     Spanish|
|        58615|     Regina Pizzeria|   Boston|   MA|      02113|4.000000000000000000|        2071|     Italian|
|        58723|            Gaslight|   Boston|   MA|      02118|4.000000000000000000|        2056|    American|
|        58723|            Gaslight|   Boston|   MA|      02118|4.000000000000000000|        2056|      French|
|        60920|  Modern Pastry Shop|   Boston|   MA|      02113|4.000000000000000000|        2042|     Italian|
|        59453|Gourmet Dumpling ...|   Boston|   MA|      02111|3.500000000000000000|        1990|   Taiwanese|
|        59453|Gourmet Dumpling ...|   Boston|   MA|      02111|3.500000000000000000|        1990|     Chinese|
|        59204|Russell House Tavern|Cambridge|   MA|      02138|4.000000000000000000|        1965|    American|
|        60732|Eastern Standard ...|   Boston|   MA|      02215|4.000000000000000000|        1890|    American|
|        60732|Eastern Standard ...|   Boston|   MA|      02215|4.000000000000000000|        1890|      French|
|        56970|         Border Café|Cambridge|   MA|      02138|4.000000000000000000|        1880|     Mexican|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+

我想根据城市、州和美食对数据进行分区，并按星级排序并查看计数，最后限制每个分区的记录数。

这可以用pyspark完成吗？

Answer 1

您可以在 windowing 之后将 row_number 添加到分区并基于此进行过滤以限制每个 window 的记录。您可以在下面的代码中使用 max_number_of_rows_per_partition 变量控制每个 window 的最大行数。

Since your question did not include the way you want stars and review_count ordered, I have assumed them to be descending.

import pyspark.sql.functions as F
from pyspark.sql import Window

window_spec = Window.partitionBy("city", "state", "cuisine_name")\
                    .orderBy(F.col("stars").desc(), F.col("review_count").desc())

max_number_of_rows_per_partition = 3

df.withColumn("row_number", F.row_number().over(window_spec))\
  .filter(F.col("row_number") <= max_number_of_rows_per_partition)\
  .drop("row_number")\
  .show(200, False)

如何限制和分区 PySpark Dataframe 中的数据

How to Limit and Partition data in PySpqrk Dataframe

pyspark

apache-spark-sql