派斯帕克。仅获取最小值

Pyspark. Getting only minimal values

我只想获取最小值。

import pyspark as ps

spark = ps.sql.SparkSession.builder.master('local[4]')\
    .appName('some-name-here').getOrCreate()

sc = spark.sparkContext

sc.textFile('path-to.csv')\
    .map(lambda x: x.replace('"', '').split(','))\
    .filter(lambda x: not x[0].startswith('player_id'))\
    .map(lambda x: (x[2] + " " + x[1], int(x[8]) if x[8] else 0))\
    .reduceByKey(lambda value1, value2: value1 + value2)\
    .sortBy(lambda price: price[1], ascending=True).collect()

这是我得到的:

[('Cedric Ceballos', 0), ('Maurcie Cheeks', 0), ('James Foster', 0), ('Billy Gabor', 0), ('Julius Keye', 0), ('Anthony Mason', 0), ('Chuck Noble', 0), ('Theo Ratliff', 0), ('Austin Carr', 0), ('Mark Eaton', 0), ('A.C. Green', 0), ('Darrall Imhoff', 0), ('John Johnson', 0), ('Neil Johnson', 0), ('Jim King', 0), ('Max Zaslofsky', 1), ('Don Barksdale', 1), ('Curtis Rowe', 1), ('Caron Butler', 2), ('Chris Gatling', 2)].

如您所见,有很多值为 0 的键,这是最小值。我该如何排序?

您可以将最小值收集到一个变量中,并根据该变量进行等式过滤:

rdd = sc.textFile('path-to.csv')\
    .map(lambda x: x.replace('"', '').split(','))\
    .filter(lambda x: not x[0].startswith('player_id'))\
    .map(lambda x: (x[2] + " " + x[1], int(x[8]) if x[8] else 0))\
    .reduceByKey(lambda value1, value2: value1 + value2)\
    .sortBy(lambda price: price[1], ascending=True)

minval = rdd.take(1)[0][1]
rdd2 = rdd.filter(lambda x: x[1] == minval)

您的数据已经排序。使用take(1)而不是collect()获取第一个元素,最小值