Pyspark:如何查询 returns 只有条目大于 1 的 ID?
Pyspark: how to make a query that returns only IDs with entries greater than one?
我有一个 table 如下所示
Timestamp, Name, Value
1577862435, Tom, 0.25
1577915618, Tom, 0.50
1577839734, John, 0.34
1577839734, John, 0.34
1577839734, John, 0.34
1577839734, Eric, 0.34
为了计算每个用户的条目,我这样做
query = """ SELECT ID,
COUNT(*) AS `num`
FROM
myTable
GROUP BY Name
ORDER BY num DESC
"""
count = spark.sql(query)
count.show()
Name num
John 3
Tom 2
Eric 1
我将查询返回具有 num>=2
的 ID。我的最终 table 应该是:
Timestamp, Name, Value
1577862435, Tom, 0.25
1577915618, Tom, 0.50
1577839734, John, 0.34
1577839734, John, 0.34
1577839734, John, 0.34
那你应该使用 window 功能。
from pyspark.sql import Window
df = spark.table("myTable")
df.withColumn(
"cnt",
F.count('*').over(Window.partitionBy("Name"))
).where("cnt > 1").drop("cnt").show()
你可以这样写 SQL:
SELECT ID, Name, num
FROM (SELECT t.*, COUNT(*) OVER (PARTITION BY Name) AS num
FROM myTable t
) t
WHERE num >= 2;
我有一个 table 如下所示
Timestamp, Name, Value
1577862435, Tom, 0.25
1577915618, Tom, 0.50
1577839734, John, 0.34
1577839734, John, 0.34
1577839734, John, 0.34
1577839734, Eric, 0.34
为了计算每个用户的条目,我这样做
query = """ SELECT ID,
COUNT(*) AS `num`
FROM
myTable
GROUP BY Name
ORDER BY num DESC
"""
count = spark.sql(query)
count.show()
Name num
John 3
Tom 2
Eric 1
我将查询返回具有 num>=2
的 ID。我的最终 table 应该是:
Timestamp, Name, Value
1577862435, Tom, 0.25
1577915618, Tom, 0.50
1577839734, John, 0.34
1577839734, John, 0.34
1577839734, John, 0.34
那你应该使用 window 功能。
from pyspark.sql import Window
df = spark.table("myTable")
df.withColumn(
"cnt",
F.count('*').over(Window.partitionBy("Name"))
).where("cnt > 1").drop("cnt").show()
你可以这样写 SQL:
SELECT ID, Name, num
FROM (SELECT t.*, COUNT(*) OVER (PARTITION BY Name) AS num
FROM myTable t
) t
WHERE num >= 2;