运行带集群和不带集群的pyspark程序有什么区别？

Question

我有一个程序包含几行使用 pyspark 的函数（其余是正常的 Python）。

我的代码中使用 pyspark 的部分：

X.to_csv(r'first.txt', header=None, index=None, sep=' ', mode='a')

# load the dataset 
rows = np.loadtxt('first.txt')

rows = sc.parallelize(rows)

mat = RowMatrix(rows)
start_time = time.time()  #to calculate the execution time of the function bellow

# compute SVD 
svd = mat.computeSVD(20, computeU=True)

exemple_one = time.time() - start_time
print("---Exemple one : %s seconds ---" % (exemple_one))

first.txt 是一个具有 2346x27 矩阵的文本文件

0.0 0.0 ... 0.0 0.0 0.06664409020350408 0.0 0.0 0.0 0.0 0.0 .... 0 0.0 0.0

运行我在集群上的程序（作为 YARN）和在我自己的机器上的运行有什么区别使用 （Python 命令）？这些区别是什么。

Answer 1

您将获得的结果没有差异。
根据您的工作量，当运行在本地时，您可能会遇到资源问题。

Spark 使您能够使用资源管理器（例如 YARN），以便通过从资源管理器获取执行器来扩展您的应用程序。

请查看 Spark 官方文档中的以下链接，看看您是否有更具体的问题：

运行带集群和不带集群的pyspark程序有什么区别？

What is the difference between running pyspark program with and without cluster?

python

cluster-computing

apache-spark

pyspark

运行 带集群和不带集群的pyspark程序有什么区别？

What is the difference between running pyspark program with and without cluster?

python

cluster-computing

apache-spark

pyspark

运行带集群和不带集群的pyspark程序有什么区别？