拆分 RDD 以进行 K 折验证：pyspark

Question

我有一个数据集，我想对其应用朴素贝叶斯。我将使用 K 折技术进行验证。我的数据有两个 class，它们是有序的，即如果我的数据集有 100 行，则前 50 行属于第一个 class，接下来的 50 行属于第二个 class。因此，我首先要打乱数据，然后随机形成 K 折。问题是，当我尝试在 RDD 上随机拆分时，它正在创建不同大小的 RDD。我的代码和数据集的例子如下：

documentDF = sqlContext.createDataFrame([
    (0,"This is a cat".lower().split(" "), ),
    (0,"This is a dog".lower().split(" "), ),
    (0,"This is a pig".lower().split(" "), ),
    (0,"This is a mouse".lower().split(" "), ),
    (0,"This is a donkey".lower().split(" "), ),
    (0,"This is a monkey".lower().split(" "), ),
    (0,"This is a horse".lower().split(" "), ),
    (0,"This is a goat".lower().split(" "), ),
    (0,"This is a tiger".lower().split(" "), ),
    (0,"This is a lion".lower().split(" "), ),
    (1,"A mouse and a pig are friends".lower().split(" "), ),
    (1,"A pig and a dog are friends".lower().split(" "), ),
    (1,"A mouse and a cat are friends".lower().split(" "), ),
    (1,"A lion and a tiger are friends".lower().split(" "), ),
    (1,"A lion and a goat are friends".lower().split(" "), ),
    (1,"A monkey and a goat are friends".lower().split(" "), ),
    (1,"A monkey and a donkey are friends".lower().split(" "), ),
    (1,"A horse and a donkey are friends".lower().split(" "), ),
    (1,"A horse and a tiger are friends".lower().split(" "), ),
    (1,"A cat and a dog are friends".lower().split(" "), )
], ["label","text"])

from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.regression import LabeledPoint

def mapper_vector(x):
    row = x.text
    return LabeledPoint(x.label,row)

splitSize = [0.2]*5
print("splitSize"+str(splitSize))
print(sum(splitSize))
vect = documentDF.map(lambda x: mapper_vector(x))
splits = vect.randomSplit(splitSize, seed=0)

print("***********SPLITS**************")
for i in range(len(splits)):
    print("split"+str(i)+":"+str(len(splits[i].collect())))

此代码输出：

splitSize[0.2, 0.2, 0.2, 0.2, 0.2]
1.0
***********SPLITS**************
split0:1
split1:5
split2:3
split3:5
split4:6

documentDF 有 20 行，我想要来自该数据集的 5 个具有相同大小的不同的独占样本。然而，可以看出所有的分裂都有不同的大小。我做错了什么？

编辑： 根据 zero323，我没有做错任何事。那么，如果我想在不使用 ML CrossValidator 的情况下获得最终结果（如上所述），我需要更改什么？另外，为什么数字不同？如果每个拆分具有相同的权重，那么它们不应该具有相同的行数吗？另外，有没有其他方法可以随机化数据？

Answer 1

你没有做错任何事。 randomSplit 根本不提供有关数据分发的硬性保证。它使用 BernoulliCellSampler（参见），精确分数可能与运行运行不同。这是一种正常行为，在任何实际大小的数据集上应该是完全可以接受的，在这些数据集上差异在统计上应该是微不足道的。

另一方面，Spark ML 已经提供了 CrossValidator which can be used with ML Pipelines (see 示例用法）。

拆分 RDD 以进行 K 折验证：pyspark

Split RDD for K-fold validation: pyspark

python-3.x

apache-spark

pyspark

apache-spark-ml

apache-spark-mllib