Best Practice when Writing PySpark DataFrames to MySQL

df_tsv = spark.read.csv(tsv_file, sep=r'\t', header=True)
df_tsv.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })


java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver

The first thing that I want to know is how I can solve the above issue.

启动 Spark 会话时需要传递 JDBC 连接器 https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Secondly, I would like to know what the best practice is when writing data from Spark to databases like MySQL. For instance, is there an option to make it so that data from a given column in the DataFrame is stored in a specified column in the table? Or should the column names of the table be the same as those of the DataFrame?

是的,数据框列名将与 table 列名匹配。

The other option that I can think of here is to convert the DataFrame to say, a list of tuples and then use something like the mysql-python-connector to load the data into the database,

rdd = df.rdd

b = rdd.map(tuple)

data = b.collect()

# write data to database using mysql-python-connector

不,永远不会这样做,这将破坏使用 Spark(分布式计算)的所有目的。查看上面的 link,您会发现一些关于从哪里开始以及如何 read/write from/to JDBC 数据源的好建议。