Spark 随机森林 - 无法将 float 转换为 int 错误

Spark random forest - could not convert float to int error

我有数字特征和二进制响应。我正在尝试构建集成决策树,例如随机森林和梯度提升树。但是,我收到一个错误。我已经用虹膜数据重现了错误。 错误在下方,整个错误消息在底部。

TypeError: Could not convert 12.631578947368421 to int

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
y = list(iris.target)
df = pd.read_csv("https://raw.githubusercontent.com/venky14/Machine- Learning-with-Iris-Dataset/master/Iris.csv")
df = df.drop(['Species'], axis = 1)
df['label'] = y
spark_df = spark.createDataFrame(df).drop('Id')
cols = spark_df.drop('label').columns
assembler = VectorAssembler(inputCols = cols, outputCol = 'features')
output_dat = assembler.transform(spark_df).select('label', 'features')

rf = RandomForestClassifier(labelCol = "label", featuresCol = "features")
paramGrid_rf = ParamGridBuilder() \
                     .addGrid(rf.maxDepth, np.linspace(5, 30, 6)) \
                     .addGrid(rf.numTrees, np.linspace(10, 60, 20)).build()

crossval_rf = CrossValidator(estimator = rf,
                         estimatorParamMaps = paramGrid_rf,
                         evaluator = BinaryClassificationEvaluator(),
                         numFolds = 5) 

cvModel_rf = crossval_rf.fit(output_dat)

TypeError                                 Traceback (most recent call last)
<ipython-input-24-44f8f759ed8e> in <module>
      2 paramGrid_rf = ParamGridBuilder() \
      3    .addGrid(rf.maxDepth, np.linspace(5, 30, 6)) \
----> 4    .addGrid(rf.numTrees, np.linspace(10, 60, 20)) \
      5    .build()
      6 

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in build(self)
    120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
--> 122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
    123 
    124 

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in <listcomp>(.0)
    120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
--> 122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
    123 
    124 

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in to_key_value_pairs(keys, values)
    118 
    119         def to_key_value_pairs(keys, values):
--> 120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
    122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in <listcomp>(.0)
    118 
    119         def to_key_value_pairs(keys, values):
--> 120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
    122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/param/__init__.py in toInt(value)
    197             return int(value)
    198         else:
--> 199             raise TypeError("Could not convert %s to int" % value)
    200 
    201     @staticmethod

TypeError: Could not convert 12.631578947368421 to int```

maxDepthnumTrees都需要整数; Numpy linspace 产生浮点数:

import numpy as np
np.linspace(10, 60, 20)

结果:

array([ 10.        ,  12.63157895,  15.26315789,  17.89473684,
        20.52631579,  23.15789474,  25.78947368,  28.42105263,
        31.05263158,  33.68421053,  36.31578947,  38.94736842,
        41.57894737,  44.21052632,  46.84210526,  49.47368421,
        52.10526316,  54.73684211,  57.36842105,  60.        ])

因此,您的代码遇到第一个非整数值(此处 12.63157895),并产生错误。

改用arange

np.arange(10, 60, 20)
# array([10, 30, 50])