Scala Spark 设置架构重复列

Scala Spark setting schema duplicates columns

我在指定数据框的架构时遇到问题。在不设置架构的情况下,printschema() 会生成:

root
 |-- Store: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- IsHoliday: string (nullable = true)
 |-- Dept: string (nullable = true)
 |-- Weekly_Sales: string (nullable = true)
 |-- Temperature: string (nullable = true)
 |-- Fuel_Price: string (nullable = true)
 |-- MarkDown1: string (nullable = true)
 |-- MarkDown2: string (nullable = true)
 |-- MarkDown3: string (nullable = true)
 |-- MarkDown4: string (nullable = true)
 |-- MarkDown5: string (nullable = true)
 |-- CPI: string (nullable = true)
 |-- Unemployment: string (nullable = true)

但是,当我使用 .schema(schema)

指定架构时
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)

我的 printschema() 产生:

root
 |-- Store: integer (nullable = true)
 |-- Date: date (nullable = true)
 |-- IsHoliday: boolean (nullable = true)
 |-- Dept: integer (nullable = true)
 |-- Weekly_Sales: integer (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Fuel_Price: double (nullable = true)
 |-- MarkDown1: double (nullable = true)
 |-- MarkDown2: double (nullable = true)
 |-- MarkDown3: double (nullable = true)
 |-- MarkDown4: double (nullable = true)
 |-- MarkDown5: double (nullable = true)
 |-- CPI: double (nullable = true)
 |-- Unemployment: double (nullable = true)
 |-- Dept: integer (nullable = true)
 |-- Weekly_Sales: integer (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Fuel_Price: double (nullable = true)
 |-- MarkDown1: double (nullable = true)
 |-- MarkDown2: double (nullable = true)
 |-- MarkDown3: double (nullable = true)
 |-- MarkDown4: double (nullable = true)
 |-- MarkDown5: double (nullable = true)
 |-- CPI: double (nullable = true)
 |-- Unemployment: double (nullable = true)

数据框本身有所有这些重复的列,我不确定为什么。

我的代码:

// Make cutom schema
val schema = StructType(Array(
       StructField("Store", IntegerType, true),
       StructField("Date", DateType, true),
       StructField("IsHoliday", BooleanType, true),
       StructField("Dept", IntegerType, true),
       StructField("Weekly_Sales", IntegerType, true),
       StructField("Temperature", DoubleType, true),
       StructField("Fuel_Price", DoubleType, true),
       StructField("MarkDown1", DoubleType, true),
       StructField("MarkDown2", DoubleType, true),
       StructField("MarkDown3", DoubleType, true),
       StructField("MarkDown4", DoubleType, true),
       StructField("MarkDown5", DoubleType, true),
       StructField("CPI", DoubleType, true),
       StructField("Unemployment", DoubleType, true)))

val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
val train_df = dfr.load("/FileStore/tables/train.csv")
val features_df = dfr.load("/FileStore/tables/features.csv")

// Combine the train and features
val data = train_df.join(features_df, Seq("Store", "Date", "IsHoliday"), "left")
data.show(5)
data.printSchema()

一切正常。您的 train_df, features_dfload() 之后的 schema (14 columns) 具有相同的列。

在您的连接条件之后,Seq("Store", "Date", "IsHoliday") 从两个 DF 中取出这 3 列(总共 3+3 =6 列)和 join 它并给出一组列名称(3 列) .但是其余的列将来自 train_df(rest 11 columns), features_df(rest 11 columns).

因此您的 printSchema 显示 25 列 (3 + 11 + 11)。