如何在 python 中比较 Databricks notebook 中的两个模式
How to compare two schema in Databricks notebook in python
我将使用 databricks notebook 摄取数据。我想根据我对这些数据的模式的预期来验证摄入的数据模式。
基本上我有:
validation_schema = StructType([
StructField("a", StringType(), True),
StructField("b", IntegerType(), False),
StructField("c", StringType(), False),
StructField("d", StringType(), False)
])
data_ingested_good = [("foo",1,"blabla","36636"),
("foo",2,"booboo","40288"),
("bar",3,"fafa","42114"),
("bar",4,"jojo","39192"),
("baz",5,"jiji","32432")
]
data_ingested_bad = [("foo","1","blabla","36636"),
("foo","2","booboo","40288"),
("bar","3","fafa","42114"),
("bar","4","jojo","39192"),
("baz","5","jiji","32432")
]
data_ingested_good.printSchema()
data_ingested_bad.printSchema()
validation_schema.printSchema()
我见过类似的问题,但答案总是在 scala 中。
这实际上取决于您要比较的模式的确切要求和复杂性 - 例如,忽略可空性标志与考虑它、列的顺序、对 maps/structs/arrays 的支持等。另外, 如果模式匹配或不匹配,你想看到差异还是只是一个标志。
在最简单的情况下,它可以像下面这样简单 - 只需比较模式的字符串表示形式:
def compare_schemas(df1, df2):
return df1.schema.simpleString() == df2.schema.simpleString()
我个人建议使用现有的库,例如 Chispa,它具有更高级的模式比较功能 - 您可以调整检查,它会显示差异等。安装后(您可以只做 %pip install chispa
) - 如果模式不同,这将引发异常:
from chispa.schema_comparer import assert_schema_equality
assert_schema_equality(df1.schema, df2.schema)
另一种方法,您可以根据简单的python list
比较找出不同。
dept = [("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
]
deptColumns = ["dept_name","dept_id"]
dept1 = [("Finance",10,'999'),
("Marketing",20,'999'),
("Sales",30,'999'),
("IT",40,'999')
]
deptColumns1 = ["dept_name","dept_id","extracol"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
dept1DF = spark.createDataFrame(data=dept1, schema = deptColumns1)
deptDF_columns=deptDF.schema.names
dept1DF_columns=dept1DF.schema.names
list_difference = []
for item in dept1DF_columns:
if item not in deptDF_columns:
list_difference.append(item)
print(list_difference)
屏幕打印:
我将使用 databricks notebook 摄取数据。我想根据我对这些数据的模式的预期来验证摄入的数据模式。
基本上我有:
validation_schema = StructType([
StructField("a", StringType(), True),
StructField("b", IntegerType(), False),
StructField("c", StringType(), False),
StructField("d", StringType(), False)
])
data_ingested_good = [("foo",1,"blabla","36636"),
("foo",2,"booboo","40288"),
("bar",3,"fafa","42114"),
("bar",4,"jojo","39192"),
("baz",5,"jiji","32432")
]
data_ingested_bad = [("foo","1","blabla","36636"),
("foo","2","booboo","40288"),
("bar","3","fafa","42114"),
("bar","4","jojo","39192"),
("baz","5","jiji","32432")
]
data_ingested_good.printSchema()
data_ingested_bad.printSchema()
validation_schema.printSchema()
我见过类似的问题,但答案总是在 scala 中。
这实际上取决于您要比较的模式的确切要求和复杂性 - 例如,忽略可空性标志与考虑它、列的顺序、对 maps/structs/arrays 的支持等。另外, 如果模式匹配或不匹配,你想看到差异还是只是一个标志。
在最简单的情况下,它可以像下面这样简单 - 只需比较模式的字符串表示形式:
def compare_schemas(df1, df2):
return df1.schema.simpleString() == df2.schema.simpleString()
我个人建议使用现有的库,例如 Chispa,它具有更高级的模式比较功能 - 您可以调整检查,它会显示差异等。安装后(您可以只做 %pip install chispa
) - 如果模式不同,这将引发异常:
from chispa.schema_comparer import assert_schema_equality
assert_schema_equality(df1.schema, df2.schema)
另一种方法,您可以根据简单的python list
比较找出不同。
dept = [("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
]
deptColumns = ["dept_name","dept_id"]
dept1 = [("Finance",10,'999'),
("Marketing",20,'999'),
("Sales",30,'999'),
("IT",40,'999')
]
deptColumns1 = ["dept_name","dept_id","extracol"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
dept1DF = spark.createDataFrame(data=dept1, schema = deptColumns1)
deptDF_columns=deptDF.schema.names
dept1DF_columns=dept1DF.schema.names
list_difference = []
for item in dept1DF_columns:
if item not in deptDF_columns:
list_difference.append(item)
print(list_difference)
屏幕打印: