如何将具有大量唯一值的数字特征传递给 PySpark MlLib 中的随机森林回归算法？

Question

我有一个 dataset，它有一个 numeric feature 列，其中包含大量唯一值（10,000 的数量级）。我知道当我们在 PySpark 中为 Random Forest regression 算法生成模型时，我们传递了一个参数 maxBins，它应该至少等于所有特征中的最大唯一值。因此，如果我将 10,000 作为 maxBins 值传递，那么算法将无法承受负载，它要么失败，要么永远停止。如何将这样的功能传递给模型？我在几个地方读到 binning 将值放入桶中，然后将这些桶传递给模型，但我不知道如何在 PySpark 中执行此操作。谁能展示一个示例代码来做到这一点？我当前的代码是这样的：

    def parse(line):
        # line[6] and line[8] are feature columns with large unique values. line[12] is numeric label
        return (line[1],line[3],line[4],line[5],line[6],line[8],line[11],line[12])


    input = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>=0).map(lambda (line, rownum): line)



    parsed_data = (input
        .map(lambda line: line.split(","))
        .filter(lambda line: len(line) >1 )
        .map(parse))


    # Divide the input data in training and test set with 70%-30% ratio
    (train_data, test_data) = parsed_data.randomSplit([0.7, 0.3])

    label_col = "x7"


# converting RDD to dataframe. x4 and x5 are columns with large unique values
train_data_df = train_data.toDF(("x0","x1","x2","x3","x4","x5","x6","x7"))

# Indexers encode strings with doubles
string_indexers = [
   StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
   for x in train_data_df.columns if x != label_col 
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
    inputCols=["idx_{0}".format(x) for x in train_data_df.columns if x != label_col ],
    outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(train_data_df)
indexed = model.transform(train_data_df)

label_points = (indexed
.select(col(label_col).cast("float").alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))

如果有人可以提供示例代码，说明我如何修改上面的代码以对上面的两个大数值特征列进行装箱，那将很有帮助。

Answer 1

we pass a parameter maxBins which should be at least equal to maximum unique value in all features.

这不是真的。它应该大于或等于分类特征的最大类别数。您仍然需要调整此参数以获得所需的性能，否则此处无需执行其他操作。

如何将具有大量唯一值的数字特征传递给 PySpark MlLib 中的随机森林回归算法？

How to pass a numeric feature having large number of unique values to Random Forest regression algorithm in PySpark MlLib?

python

binning

random-forest

apache-spark

pyspark