pyspark:删除给定字段具有重复值的行
pyspark: remove rows that have duplicated values for a given field
我有以下数据框:
field_A | field_B | field_C | field_D
cat | 12 | black | 11
dog | 128 | white | 19
dog | 35 | yellow | 20
dog | 21 | brown | 4
bird | 10 | blue | 7
cow | 99 | brown | 34
是否可以过滤掉 field_A 中具有重复值的行。也就是说,我希望最终的数据框是:
field_A | field_B | field_C | field_D
cat | 12 | black | 11
bird | 10 | blue | 7
cow | 99 | brown | 34
这在 pyspark 中可行吗?谢谢!
创建数据
rdd = sc.parallelize([(0,1), (0,10), (0,20), (1,2), (2,1), (3,5), (3,18), (4,15), (5,18)])
t = sqlContext.createDataFrame(rdd, ["id", "score"])
t.collect()
[Row(id=0, score=1),
Row(id=0, score=10),
Row(id=0, score=20),
Row(id=1, score=2),
Row(id=2, score=1),
Row(id=3, score=5),
Row(id=3, score=18),
Row(id=4, score=15),
Row(id=5, score=18)]
获取具有给定 ID 的行数
idCounts = t.groupBy('id').count()
将 idCounts 加入原始数据框
out = t.join(idCounts,'id','left_outer').filter('count = 1').select(['id', 'score'])
out.collect
[Row(id=1, score=2),
Row(id=2, score=1),
Row(id=4, score=15),
Row(id=5, score=18)]
我有以下数据框:
field_A | field_B | field_C | field_D
cat | 12 | black | 11
dog | 128 | white | 19
dog | 35 | yellow | 20
dog | 21 | brown | 4
bird | 10 | blue | 7
cow | 99 | brown | 34
是否可以过滤掉 field_A 中具有重复值的行。也就是说,我希望最终的数据框是:
field_A | field_B | field_C | field_D
cat | 12 | black | 11
bird | 10 | blue | 7
cow | 99 | brown | 34
这在 pyspark 中可行吗?谢谢!
创建数据
rdd = sc.parallelize([(0,1), (0,10), (0,20), (1,2), (2,1), (3,5), (3,18), (4,15), (5,18)])
t = sqlContext.createDataFrame(rdd, ["id", "score"])
t.collect()
[Row(id=0, score=1), Row(id=0, score=10), Row(id=0, score=20), Row(id=1, score=2), Row(id=2, score=1), Row(id=3, score=5), Row(id=3, score=18), Row(id=4, score=15), Row(id=5, score=18)]
获取具有给定 ID 的行数
idCounts = t.groupBy('id').count()
将 idCounts 加入原始数据框
out = t.join(idCounts,'id','left_outer').filter('count = 1').select(['id', 'score'])
out.collect
[Row(id=1, score=2), Row(id=2, score=1), Row(id=4, score=15), Row(id=5, score=18)]