在 Spark RDD 中按行删除重复项
Remove duplicates by line in Spark RDD
我正在使用 Pyspark MLlib FPGrowth algorithm 做一些工作,并且有一个 rdd,每行中包含重复事务的重复示例。这导致模型训练函数因这些重复项而抛出错误。我是 Spark 的新手,我想知道如何删除重复项 within 一个 rdd 的行。例如:
#simple example
from pyspark.mllib.fpm import FPGrowth
data = [["a", "a", "b", "c"], ["a", "b", "d", "e"], ["a", "a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()
所以它看起来像:
#simple example
from pyspark.mllib.fpm import FPGrowth
data_dedup = [["a", "b", "c"], ["a", "b", "d", "e"], ["a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data_dedup)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()
并且运行不会出错。
提前致谢!
这样使用:
rdd = rdd.map(lambda x: list(set(x)))
这将删除重复项。
我正在使用 Pyspark MLlib FPGrowth algorithm 做一些工作,并且有一个 rdd,每行中包含重复事务的重复示例。这导致模型训练函数因这些重复项而抛出错误。我是 Spark 的新手,我想知道如何删除重复项 within 一个 rdd 的行。例如:
#simple example
from pyspark.mllib.fpm import FPGrowth
data = [["a", "a", "b", "c"], ["a", "b", "d", "e"], ["a", "a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()
所以它看起来像:
#simple example
from pyspark.mllib.fpm import FPGrowth
data_dedup = [["a", "b", "c"], ["a", "b", "d", "e"], ["a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data_dedup)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()
并且运行不会出错。
提前致谢!
这样使用:
rdd = rdd.map(lambda x: list(set(x)))
这将删除重复项。