Pyspark:如何查询 returns 只有条目大于 1 的 ID?

Pyspark: how to make a query that returns only IDs with entries greater than one?

我有一个 table 如下所示

Timestamp,  Name,    Value  
1577862435, Tom,      0.25  
1577915618, Tom,      0.50  
1577839734, John,     0.34
1577839734, John,     0.34
1577839734, John,     0.34
1577839734, Eric,     0.34

为了计算每个用户的条目,我这样做

query = """ SELECT ID,
            COUNT(*) AS `num`
            FROM
            myTable
            GROUP BY Name
            ORDER BY num DESC
"""
count = spark.sql(query)
count.show()

Name    num
John     3
Tom      2
Eric     1

我将查询返回具有 num>=2 的 ID。我的最终 table 应该是:

Timestamp,  Name,    Value  
1577862435, Tom,      0.25  
1577915618, Tom,      0.50  
1577839734, John,     0.34
1577839734, John,     0.34
1577839734, John,     0.34

那你应该使用 window 功能。

from pyspark.sql import Window 

df = spark.table("myTable")

df.withColumn(
    "cnt", 
    F.count('*').over(Window.partitionBy("Name"))
).where("cnt > 1").drop("cnt").show()

你可以这样写 SQL:

SELECT ID, Name, num
FROM (SELECT t.*, COUNT(*) OVER (PARTITION BY Name) AS num
      FROM myTable t
     ) t
WHERE num >= 2;