pyspark 中的重复值
Repeated values in pyspark
我在 pyspark 中有一个数据框,其中有三列
df1 = spark.createDataFrame([
('a', 3, 4.2),
('a', 7, 4.2),
('b', 7, 2.6),
('c', 7, 7.21),
('c', 11, 7.21),
('c', 18, 7.21),
('d', 15, 9.0),
], ['model', 'number', 'price'])
df1.show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| b| 7| 2.6|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
| d| 15| 9.0|
+-----+------+-----+
pyspark 中有没有办法只显示列 'price' 中重复的值?
喜欢 df2 :
df2 = spark.createDataFrame([
('a', 3, 4.2),
('a', 7, 4.2),
('c', 7, 7.21),
('c', 11, 7.21),
('c', 18, 7.21),
], ['model', 'number', 'price'])
df2.show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
+-----+------+-----+
我尝试这样做,但没有成功
df = df1.groupBy("model","price").count().filter("count > 1")
df2 = df1.where((df.model == df1.model) & (df.price == df1.price))
df2.show()
也包含了不重复的值
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| b| 7| 2.6|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
| d| 15| 9.0|
+-----+------+-----+
您可以使用 window 函数来实现。我们按价格划分,计数并过滤 count > 1
.
from pyspark.sql import Window
from pyspark.sql import functions as f
w = Window().partitionBy('price')
df1.withColumn('_c', f.count('price').over(w)).filter('_c > 1').drop('_c').show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
+-----+------+-----+
我在 pyspark 中有一个数据框,其中有三列
df1 = spark.createDataFrame([
('a', 3, 4.2),
('a', 7, 4.2),
('b', 7, 2.6),
('c', 7, 7.21),
('c', 11, 7.21),
('c', 18, 7.21),
('d', 15, 9.0),
], ['model', 'number', 'price'])
df1.show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| b| 7| 2.6|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
| d| 15| 9.0|
+-----+------+-----+
pyspark 中有没有办法只显示列 'price' 中重复的值?
喜欢 df2 :
df2 = spark.createDataFrame([
('a', 3, 4.2),
('a', 7, 4.2),
('c', 7, 7.21),
('c', 11, 7.21),
('c', 18, 7.21),
], ['model', 'number', 'price'])
df2.show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
+-----+------+-----+
我尝试这样做,但没有成功
df = df1.groupBy("model","price").count().filter("count > 1")
df2 = df1.where((df.model == df1.model) & (df.price == df1.price))
df2.show()
也包含了不重复的值
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| b| 7| 2.6|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
| d| 15| 9.0|
+-----+------+-----+
您可以使用 window 函数来实现。我们按价格划分,计数并过滤 count > 1
.
from pyspark.sql import Window
from pyspark.sql import functions as f
w = Window().partitionBy('price')
df1.withColumn('_c', f.count('price').over(w)).filter('_c > 1').drop('_c').show()
+-----+------+-----+
|model|number|price|
+-----+------+-----+
| a| 3| 4.2|
| a| 7| 4.2|
| c| 7| 7.21|
| c| 11| 7.21|
| c| 18| 7.21|
+-----+------+-----+