pyspark 或 MLLib 中有 train_test_split 吗？

Question

下面这个经典的 sklearm 经典 train_test_split 代码是否有任何 pyspark / MLLib 版本？

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(featuresonly, 
                                                    target, 
                                                    test_size = 0.2, 
                                                    random_state = 123)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
print("Training set has good {} samples.".format(len(y_train) -y_train.sum()))
print("Testing set has good {} samples.".format(len(y_test) -y_test.sum()))

Answer 1

根据文档，它被称为 RandomSplit
例子很多here看看怎么用

Answer 2

RandomSplit - 如上所述 - 是要走的路

train, test = final_data.randomSplit([0.7,0.3], seed=4000)

然后，你可以统计你在训练集中的标签

dataset_size=float(train.select("label").count())

Positives=train.select("label").where('label == 1').count()

percentage_ones=(float(Positives)/float(dataset_size))*100

Negatives=float(dataset_size-Positives)

print('The number of ones are {}'.format(Positives))

print('Percentage of ones are {}'.format(percentage_ones))

print(' The number of zeroes are {}'.format(Negatives))

pyspark 或 MLLib 中有 train_test_split 吗？

Is there any train_test_split in pyspark or MLLib?

python

dataframe

pyspark

apache-spark-mllib