如何将具有大量唯一值的数字特征传递给 PySpark MlLib 中的随机森林回归算法?

How to pass a numeric feature having large number of unique values to Random Forest regression algorithm in PySpark MlLib?

我有一个 dataset,它有一个 numeric feature 列,其中包含大量唯一值(10,000 的数量级)。我知道当我们在 PySpark 中为 Random Forest regression 算法生成模型时,我们传递了一个参数 maxBins,它应该至少等于所有特征中的最大唯一值。因此,如果我将 10,000 作为 maxBins 值传递,那么算法将无法承受负载,它要么失败,要么永远停止。如何将这样的功能传递给模型?我在几个地方读到 binning 将值放入桶中,然后将这些桶传递给模型,但我不知道如何在 PySpark 中执行此操作。谁能展示一个示例代码来做到这一点?我当前的代码是这样的:

    def parse(line):
        # line[6] and line[8] are feature columns with large unique values. line[12] is numeric label
        return (line[1],line[3],line[4],line[5],line[6],line[8],line[11],line[12])


    input = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>=0).map(lambda (line, rownum): line)



    parsed_data = (input
        .map(lambda line: line.split(","))
        .filter(lambda line: len(line) >1 )
        .map(parse))


    # Divide the input data in training and test set with 70%-30% ratio
    (train_data, test_data) = parsed_data.randomSplit([0.7, 0.3])

    label_col = "x7"


# converting RDD to dataframe. x4 and x5 are columns with large unique values
train_data_df = train_data.toDF(("x0","x1","x2","x3","x4","x5","x6","x7"))

# Indexers encode strings with doubles
string_indexers = [
   StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
   for x in train_data_df.columns if x != label_col 
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
    inputCols=["idx_{0}".format(x) for x in train_data_df.columns if x != label_col ],
    outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(train_data_df)
indexed = model.transform(train_data_df)

label_points = (indexed
.select(col(label_col).cast("float").alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))

如果有人可以提供示例代码,说明我如何修改上面的代码以对上面的两个大数值特征列进行装箱,那将很有帮助。

we pass a parameter maxBins which should be at least equal to maximum unique value in all features.

这不是真的。它应该大于或等于分类特征的最大类别数。您仍然需要调整此参数以获得所需的性能,否则此处无需执行其他操作。