pyspark 或 MLLib 中有 train_test_split 吗?
Is there any train_test_split in pyspark or MLLib?
下面这个经典的 sklearm 经典 train_test_split 代码是否有任何 pyspark / MLLib 版本?
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(featuresonly,
target,
test_size = 0.2,
random_state = 123)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
print("Training set has good {} samples.".format(len(y_train) -y_train.sum()))
print("Testing set has good {} samples.".format(len(y_test) -y_test.sum()))
根据文档,它被称为 RandomSplit
例子很多here看看怎么用
RandomSplit - 如上所述 - 是要走的路
train, test = final_data.randomSplit([0.7,0.3], seed=4000)
然后,你可以统计你在训练集中的标签
dataset_size=float(train.select("label").count())
Positives=train.select("label").where('label == 1').count()
percentage_ones=(float(Positives)/float(dataset_size))*100
Negatives=float(dataset_size-Positives)
print('The number of ones are {}'.format(Positives))
print('Percentage of ones are {}'.format(percentage_ones))
print(' The number of zeroes are {}'.format(Negatives))
下面这个经典的 sklearm 经典 train_test_split 代码是否有任何 pyspark / MLLib 版本?
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(featuresonly,
target,
test_size = 0.2,
random_state = 123)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
print("Training set has good {} samples.".format(len(y_train) -y_train.sum()))
print("Testing set has good {} samples.".format(len(y_test) -y_test.sum()))
根据文档,它被称为 RandomSplit
例子很多here看看怎么用
RandomSplit - 如上所述 - 是要走的路
train, test = final_data.randomSplit([0.7,0.3], seed=4000)
然后,你可以统计你在训练集中的标签
dataset_size=float(train.select("label").count())
Positives=train.select("label").where('label == 1').count()
percentage_ones=(float(Positives)/float(dataset_size))*100
Negatives=float(dataset_size-Positives)
print('The number of ones are {}'.format(Positives))
print('Percentage of ones are {}'.format(percentage_ones))
print(' The number of zeroes are {}'.format(Negatives))