Scala Spark 设置架构重复列
Scala Spark setting schema duplicates columns
我在指定数据框的架构时遇到问题。在不设置架构的情况下,printschema() 会生成:
root
|-- Store: string (nullable = true)
|-- Date: string (nullable = true)
|-- IsHoliday: string (nullable = true)
|-- Dept: string (nullable = true)
|-- Weekly_Sales: string (nullable = true)
|-- Temperature: string (nullable = true)
|-- Fuel_Price: string (nullable = true)
|-- MarkDown1: string (nullable = true)
|-- MarkDown2: string (nullable = true)
|-- MarkDown3: string (nullable = true)
|-- MarkDown4: string (nullable = true)
|-- MarkDown5: string (nullable = true)
|-- CPI: string (nullable = true)
|-- Unemployment: string (nullable = true)
但是,当我使用 .schema(schema)
指定架构时
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
我的 printschema() 产生:
root
|-- Store: integer (nullable = true)
|-- Date: date (nullable = true)
|-- IsHoliday: boolean (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
数据框本身有所有这些重复的列,我不确定为什么。
我的代码:
// Make cutom schema
val schema = StructType(Array(
StructField("Store", IntegerType, true),
StructField("Date", DateType, true),
StructField("IsHoliday", BooleanType, true),
StructField("Dept", IntegerType, true),
StructField("Weekly_Sales", IntegerType, true),
StructField("Temperature", DoubleType, true),
StructField("Fuel_Price", DoubleType, true),
StructField("MarkDown1", DoubleType, true),
StructField("MarkDown2", DoubleType, true),
StructField("MarkDown3", DoubleType, true),
StructField("MarkDown4", DoubleType, true),
StructField("MarkDown5", DoubleType, true),
StructField("CPI", DoubleType, true),
StructField("Unemployment", DoubleType, true)))
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
val train_df = dfr.load("/FileStore/tables/train.csv")
val features_df = dfr.load("/FileStore/tables/features.csv")
// Combine the train and features
val data = train_df.join(features_df, Seq("Store", "Date", "IsHoliday"), "left")
data.show(5)
data.printSchema()
一切正常。您的 train_df, features_df
与 load()
之后的 schema (14 columns)
具有相同的列。
在您的连接条件之后,Seq("Store", "Date", "IsHoliday")
从两个 DF 中取出这 3 列(总共 3+3 =6 列)和 join
它并给出一组列名称(3 列) .但是其余的列将来自 train_df(rest 11 columns), features_df(rest 11 columns).
因此您的 printSchema 显示 25 列 (3 + 11 + 11)。
我在指定数据框的架构时遇到问题。在不设置架构的情况下,printschema() 会生成:
root
|-- Store: string (nullable = true)
|-- Date: string (nullable = true)
|-- IsHoliday: string (nullable = true)
|-- Dept: string (nullable = true)
|-- Weekly_Sales: string (nullable = true)
|-- Temperature: string (nullable = true)
|-- Fuel_Price: string (nullable = true)
|-- MarkDown1: string (nullable = true)
|-- MarkDown2: string (nullable = true)
|-- MarkDown3: string (nullable = true)
|-- MarkDown4: string (nullable = true)
|-- MarkDown5: string (nullable = true)
|-- CPI: string (nullable = true)
|-- Unemployment: string (nullable = true)
但是,当我使用 .schema(schema)
指定架构时val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
我的 printschema() 产生:
root
|-- Store: integer (nullable = true)
|-- Date: date (nullable = true)
|-- IsHoliday: boolean (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
数据框本身有所有这些重复的列,我不确定为什么。
我的代码:
// Make cutom schema
val schema = StructType(Array(
StructField("Store", IntegerType, true),
StructField("Date", DateType, true),
StructField("IsHoliday", BooleanType, true),
StructField("Dept", IntegerType, true),
StructField("Weekly_Sales", IntegerType, true),
StructField("Temperature", DoubleType, true),
StructField("Fuel_Price", DoubleType, true),
StructField("MarkDown1", DoubleType, true),
StructField("MarkDown2", DoubleType, true),
StructField("MarkDown3", DoubleType, true),
StructField("MarkDown4", DoubleType, true),
StructField("MarkDown5", DoubleType, true),
StructField("CPI", DoubleType, true),
StructField("Unemployment", DoubleType, true)))
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
val train_df = dfr.load("/FileStore/tables/train.csv")
val features_df = dfr.load("/FileStore/tables/features.csv")
// Combine the train and features
val data = train_df.join(features_df, Seq("Store", "Date", "IsHoliday"), "left")
data.show(5)
data.printSchema()
一切正常。您的 train_df, features_df
与 load()
之后的 schema (14 columns)
具有相同的列。
在您的连接条件之后,Seq("Store", "Date", "IsHoliday")
从两个 DF 中取出这 3 列(总共 3+3 =6 列)和 join
它并给出一组列名称(3 列) .但是其余的列将来自 train_df(rest 11 columns), features_df(rest 11 columns).
因此您的 printSchema 显示 25 列 (3 + 11 + 11)。