Spark 检查数据集中是否至少有 n 个元素

Question

我正在使用 Spark (2.3.1) 对数据集进行一些处理。出于某种原因，我想在计算之前知道我的数据集中是否有足够的数据。

基本的解决方案如下：

int count = myDataset.count();
int threshold = 100;

if (count>threshold){
    // compute
}else{
    System.out.println("Not enough data to do computation");
}

但确实效率低下。另一种更有效的解决方案是使用 countApprox() 函数。

int count = (long) (myDataset.rdd().countApprox(1000,0.90).getFinalValue().mean());

但就我而言，它可能更有效。

解决这个问题的最佳方法是什么？

注：

我正在考虑迭代我的数据，手动计算行数并在达到阈值时停止，但我不确定这是最好的解决方案。

Answer 1

也许，"limit" 可以更有效率：

df.limit(threshold).count()

Answer 2

如果您这样做 myDataset.count()，它将扫描完整数据并且可能会很慢。

要加快速度，您可以对数据集执行 limit(threshold+1)。这将为您 return 另一个具有 threshold+1 行的数据集。在这上面，你可以做 .count().


    int threshold = 100;
    int totalRowsAfterLimit = myDataset.limit(threshold+1).count();

    if (totalRowsAfterLimit > threshold) {
        // compute
    } else {
        System.out.println("Not enough data to do computation");
    }

limit(threshold+1) 将确保您的基础作业仅读取有限数量的记录，并且完成速度更快。

Spark 检查数据集中是否至少有 n 个元素

Spark Check if there is At Least n element in dataset

apache-spark

apache-spark-dataset