如何找出读取的数据总大小以及哪些数据属于Spark中的哪个节点

Question

假设我正在使用 Apache spark 读取这样的数据集：

City | Region |  Population 
A    |     A1  |     150000
A     |    A2    |   50000
B     |    B1    |   250000
C     |    C1     |  350000

在此基础上创建数据框后，假设我根据城市对其进行了重新分区。现在如果我想知道我的spark集群的哪个节点有城市A的信息，是否可以知道？如果是，请多多指教。

请问另一个问题，我如何知道 spark 作为数据帧读取的数据的总大小？

Answer 1

这里有几个问题。

1.You想看看每个节点正在处理什么样的数据

 Here executor nodes would only perform the operations defined in the rdd or dataframe transformations to a chunk of data that is available in partitions in that executor node.

我认为检查节点内数据的最佳方法可能是为驱动程序和执行程序启用日志记录，并在 rdd/df 操作中写入日志条目。这些日志可以发布到本地磁盘执行器的，你需要连接到每个执行器节点来验证属于每个节点的数据

如果你想知道读入dataframe的dataframe的总大小请参考下面

如何找出读取的数据总大小以及哪些数据属于Spark中的哪个节点

How to find out the total size of data read and which data is belonging to which node in Spark

hadoop

hdfs

hadoop-yarn

apache-spark

apache-spark-sql