pyspark 中的重复值

Repeated values in pyspark

我在 pyspark 中有一个数据框,其中有三列

df1 = spark.createDataFrame([
    ('a', 3, 4.2),
    ('a', 7, 4.2),
    ('b', 7, 2.6),
    ('c', 7, 7.21),
    ('c', 11, 7.21),
    ('c', 18, 7.21),
    ('d', 15, 9.0),
], ['model', 'number', 'price'])
df1.show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
|    a|     3|  4.2|
|    a|     7|  4.2|
|    b|     7|  2.6|
|    c|     7| 7.21|
|    c|    11| 7.21|
|    c|    18| 7.21|
|    d|    15|  9.0|
+-----+------+-----+

pyspark 中有没有办法只显示列 'price' 中重复的值?

喜欢 df2 :

df2 = spark.createDataFrame([
    ('a', 3, 4.2),
    ('a', 7, 4.2),
    ('c', 7, 7.21),
    ('c', 11, 7.21),
    ('c', 18, 7.21),
], ['model', 'number', 'price'])
df2.show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
|    a|     3|  4.2|
|    a|     7|  4.2|
|    c|     7| 7.21|
|    c|    11| 7.21|
|    c|    18| 7.21|
+-----+------+-----+

我尝试这样做,但没有成功

df = df1.groupBy("model","price").count().filter("count > 1")
df2 = df1.where((df.model == df1.model) & (df.price == df1.price))
df2.show()

也包含了不重复的值


+-----+------+-----+
|model|number|price|
+-----+------+-----+
|    a|     3|  4.2|
|    a|     7|  4.2|
|    b|     7|  2.6|
|    c|     7| 7.21|
|    c|    11| 7.21|
|    c|    18| 7.21|
|    d|    15|  9.0|
+-----+------+-----+

您可以使用 window 函数来实现。我们按价格划分,计数并过滤 count > 1.

from pyspark.sql import Window
from pyspark.sql import functions as f

w = Window().partitionBy('price')

df1.withColumn('_c', f.count('price').over(w)).filter('_c > 1').drop('_c').show()

+-----+------+-----+
|model|number|price|
+-----+------+-----+
|    a|     3|  4.2|
|    a|     7|  4.2|
|    c|     7| 7.21|
|    c|    11| 7.21|
|    c|    18| 7.21|
+-----+------+-----+