合并两个不同的 csv 文件并将它们合二为一
combine two different csv files and make them into one
我想将csv1和csv2合并成final_csv,schema只有String类型的列(文件内容如下):
csv1
emp_name designation salary_col
smith manager 40000
john analyst 35000
adam sr.engineer 50000
eve QA 36000
mills sr.manager 44000
csv2
emp_name designation advance_salary_col
smith manager 2000
john analyst 3030
adam sr.engineer 5044
eve QA 3600
mills sr.manager 4500
final_csv
emp_name designation salary_col advance_salary_col
smith manager 40000 2000
john analyst 35000 3030
adam sr.engineer 50000 5044
eve QA 36000 3600
mills sr.manager 44000 4500
我尝试使用几种方法 Union、Intersect、UnionByName,但在我的 final_df
中,scala 中的所有列都得到了空值
val emp_dataDf1 = spark.read.format("csv")
.option("header", "true")
.load("data/emp_data1.csv")
val emp_dataDf2 = spark.read.format("csv")
.option("header", "true")
.load("/data/emp_data2.csv")
val final_df= emp_dataDf1.union(emp_dataDf2)
这是一个连接。请参阅有关 SQL joins and joins in Spark.
的文档
val final_df = emp_dataDf1.join(emp_dataDf2, Seq("empname", "designation"))
注意:您需要提及 csv 文件的分隔符,如果您使用空格(或与此相关的任何分隔符),请确保使用准确的字符串文字。
这里是你如何做到的:
val csv1 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv1_file")
val csv2 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv2_file")
val joinExpression = Seq("emp_name", "designation")
csv1.join(csv2, joinExpression, "inner").show(false)
/* output *
*
+--------+-----------+----------+------------------+
|emp_name|designation|salary_col|advance_salary_col|
+--------+-----------+----------+------------------+
|smith |manager |40000 |2000 |
|john |analyst |35000 |3030 |
|adam |sr.engineer|50000 |5044 |
|eve |QA |36000 |3600 |
|mills |sr.manager |44000 |4500 |
+--------+-----------+----------+------------------+
*/
我想将csv1和csv2合并成final_csv,schema只有String类型的列(文件内容如下):
csv1
emp_name designation salary_col
smith manager 40000
john analyst 35000
adam sr.engineer 50000
eve QA 36000
mills sr.manager 44000
csv2
emp_name designation advance_salary_col
smith manager 2000
john analyst 3030
adam sr.engineer 5044
eve QA 3600
mills sr.manager 4500
final_csv
emp_name designation salary_col advance_salary_col
smith manager 40000 2000
john analyst 35000 3030
adam sr.engineer 50000 5044
eve QA 36000 3600
mills sr.manager 44000 4500
我尝试使用几种方法 Union、Intersect、UnionByName,但在我的 final_df
中,scala 中的所有列都得到了空值 val emp_dataDf1 = spark.read.format("csv")
.option("header", "true")
.load("data/emp_data1.csv")
val emp_dataDf2 = spark.read.format("csv")
.option("header", "true")
.load("/data/emp_data2.csv")
val final_df= emp_dataDf1.union(emp_dataDf2)
这是一个连接。请参阅有关 SQL joins and joins in Spark.
的文档val final_df = emp_dataDf1.join(emp_dataDf2, Seq("empname", "designation"))
注意:您需要提及 csv 文件的分隔符,如果您使用空格(或与此相关的任何分隔符),请确保使用准确的字符串文字。
这里是你如何做到的:
val csv1 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv1_file")
val csv2 = spark
.read
.option("header","true")
.option("sep", " ")
.csv("your_csv2_file")
val joinExpression = Seq("emp_name", "designation")
csv1.join(csv2, joinExpression, "inner").show(false)
/* output *
*
+--------+-----------+----------+------------------+
|emp_name|designation|salary_col|advance_salary_col|
+--------+-----------+----------+------------------+
|smith |manager |40000 |2000 |
|john |analyst |35000 |3030 |
|adam |sr.engineer|50000 |5044 |
|eve |QA |36000 |3600 |
|mills |sr.manager |44000 |4500 |
+--------+-----------+----------+------------------+
*/