Great Expectations 自定义期望不按要求忽略空值
Great Expectations custom expectation not ignoring nulls as requested
我们使用的库版本:
snowconn==3.7.1
snowflake-connector-python==2.3.10
snowflake-sqlalchemy==1.2.3
SQLAlchemy==1.3.23
great_expectations==0.13.10
pandas==1.1.5
请注意,我们自己从 Snowflake 获取数据,然后将其数据框输入 Great Expectations。我知道 GE 有一个 Snowflake 数据源,它在我的列表中以添加它。但我认为即使不使用该数据源,此设置也应该有效。
我们有以下 Great Expectations 数据上下文配置:
data_context_config = DataContextConfig(
datasources={
datasource_name: DatasourceConfig(
class_name='PandasDatasource',
data_asset_type={
'module_name': 'dataqa.dataset',
'class_name': 'CustomPandasDataset'
}
)
},
store_backend_defaults=S3StoreBackendDefaults(
default_bucket_name=METADATA_BUCKET,
expectations_store_prefix=EXPECTATIONS_PATH,
validations_store_prefix=VALIDATIONS_PATH,
data_docs_prefix=DATA_DOCS_PATH,
),
validation_operators={
"action_list_operator": {
"class_name": "ActionListValidationOperator",
"action_list": [
{
"name": "store_validation_result",
"action": {"class_name": "StoreValidationResultAction"},
},
{
"name": "store_evaluation_params",
"action": {"class_name": "StoreEvaluationParametersAction"},
},
{
"name": "update_data_docs",
"action": {"class_name": "UpdateDataDocsAction"},
},
],
}
}
)
ge_context = BaseDataContext(project_config=data_context_config)
CustomPandasDataset
定义为:
class CustomPandasDataset(PandasDataset):
_data_asset_type = "CustomPandasDataset"
@MetaPandasDataset.multicolumn_map_expectation
def expect_column_A_equals_column_B_column_C_ratio(
self,
column_list,
ignore_row_if='any_value_is_missing'
):
column_a = column_list.iloc[:,0]
column_b = column_list.iloc[:,1]
column_c = column_list.iloc[:,2]
return abs(column_a - (1.0 - (column_b/column_c))) <= 0.001
并称呼为:
cols = ['a', 'b', 'c']
batch.expect_column_A_equals_column_B_column_C_ratio(
cols,
catch_exceptions=True
)
稍后我们将像这样验证数据上下文:
return ge_context.run_validation_operator(
"action_list_operator",
assets_to_validate=batches,
run_id=run_id)["success"]
通常,a
和 b
列在我们的数据中是 null
。鉴于我已经在自定义期望上设置了 ignore_row_if='any_value_is_missing'
标志,我期望在任何列 a
、b
或 [=25= 中具有 null
值的行] 被跳过。但是 Great Expectations 并没有跳过它们,而是将它们添加到输出的 unexpected
或“失败”字段中:
result
element_count 1000
missing_count 0
missing_percent 0
unexpected_count 849
unexpected_percent 84.89999999999999
unexpected_percent_total 84.89999999999999
unexpected_percent_nonmissing 84.89999999999999result
element_count 1000
missing_count 0
missing_percent 0
unexpected_count 849
unexpected_percent 84.89999999999999
unexpected_percent_total 84.89999999999999
unexpected_percent_nonmissing 84.89999999999999
partial_unexpected_list
0
a null
b null
c 1.63
我不确定为什么会这样。在远大前程 source 中,multicolumn_map_expectation
会:
...
elif ignore_row_if == "any_value_is_missing":
boolean_mapped_skip_values = test_df.isnull().any(axis=1)
...
boolean_mapped_success_values = func(
self, test_df[boolean_mapped_skip_values == False], *args, **kwargs
)
success_count = boolean_mapped_success_values.sum()
nonnull_count = (~boolean_mapped_skip_values).sum()
element_count = len(test_df)
unexpected_list = test_df[
(boolean_mapped_skip_values == False)
& (boolean_mapped_success_values == False)
]
unexpected_index_list = list(unexpected_list.index)
success, percent_success = self._calc_map_expectation_success(
success_count, nonnull_count, mostly
)
我将其解释为忽略包含 null
的行( 不是 将它们添加到 unexpected
列表并且不使用它们来确定 percent_success
).我在我们的代码中删除了一个 pdb
并验证了我们调用期望的数据帧可以以正确的方式操作以获得“合理”数据(test_df.isnull().any(axis=1)
),但出于某种原因Great Expectations 允许那些空值溜走。有人知道为什么吗?
我相信发帖人在这里提交了 Github 问题:https://github.com/great-expectations/great_expectations/issues/2460。可以在那里跟踪进度。
我们使用的库版本:
snowconn==3.7.1
snowflake-connector-python==2.3.10
snowflake-sqlalchemy==1.2.3
SQLAlchemy==1.3.23
great_expectations==0.13.10
pandas==1.1.5
请注意,我们自己从 Snowflake 获取数据,然后将其数据框输入 Great Expectations。我知道 GE 有一个 Snowflake 数据源,它在我的列表中以添加它。但我认为即使不使用该数据源,此设置也应该有效。
我们有以下 Great Expectations 数据上下文配置:
data_context_config = DataContextConfig(
datasources={
datasource_name: DatasourceConfig(
class_name='PandasDatasource',
data_asset_type={
'module_name': 'dataqa.dataset',
'class_name': 'CustomPandasDataset'
}
)
},
store_backend_defaults=S3StoreBackendDefaults(
default_bucket_name=METADATA_BUCKET,
expectations_store_prefix=EXPECTATIONS_PATH,
validations_store_prefix=VALIDATIONS_PATH,
data_docs_prefix=DATA_DOCS_PATH,
),
validation_operators={
"action_list_operator": {
"class_name": "ActionListValidationOperator",
"action_list": [
{
"name": "store_validation_result",
"action": {"class_name": "StoreValidationResultAction"},
},
{
"name": "store_evaluation_params",
"action": {"class_name": "StoreEvaluationParametersAction"},
},
{
"name": "update_data_docs",
"action": {"class_name": "UpdateDataDocsAction"},
},
],
}
}
)
ge_context = BaseDataContext(project_config=data_context_config)
CustomPandasDataset
定义为:
class CustomPandasDataset(PandasDataset):
_data_asset_type = "CustomPandasDataset"
@MetaPandasDataset.multicolumn_map_expectation
def expect_column_A_equals_column_B_column_C_ratio(
self,
column_list,
ignore_row_if='any_value_is_missing'
):
column_a = column_list.iloc[:,0]
column_b = column_list.iloc[:,1]
column_c = column_list.iloc[:,2]
return abs(column_a - (1.0 - (column_b/column_c))) <= 0.001
并称呼为:
cols = ['a', 'b', 'c']
batch.expect_column_A_equals_column_B_column_C_ratio(
cols,
catch_exceptions=True
)
稍后我们将像这样验证数据上下文:
return ge_context.run_validation_operator(
"action_list_operator",
assets_to_validate=batches,
run_id=run_id)["success"]
通常,a
和 b
列在我们的数据中是 null
。鉴于我已经在自定义期望上设置了 ignore_row_if='any_value_is_missing'
标志,我期望在任何列 a
、b
或 [=25= 中具有 null
值的行] 被跳过。但是 Great Expectations 并没有跳过它们,而是将它们添加到输出的 unexpected
或“失败”字段中:
result
element_count 1000
missing_count 0
missing_percent 0
unexpected_count 849
unexpected_percent 84.89999999999999
unexpected_percent_total 84.89999999999999
unexpected_percent_nonmissing 84.89999999999999result
element_count 1000
missing_count 0
missing_percent 0
unexpected_count 849
unexpected_percent 84.89999999999999
unexpected_percent_total 84.89999999999999
unexpected_percent_nonmissing 84.89999999999999
partial_unexpected_list
0
a null
b null
c 1.63
我不确定为什么会这样。在远大前程 source 中,multicolumn_map_expectation
会:
...
elif ignore_row_if == "any_value_is_missing":
boolean_mapped_skip_values = test_df.isnull().any(axis=1)
...
boolean_mapped_success_values = func(
self, test_df[boolean_mapped_skip_values == False], *args, **kwargs
)
success_count = boolean_mapped_success_values.sum()
nonnull_count = (~boolean_mapped_skip_values).sum()
element_count = len(test_df)
unexpected_list = test_df[
(boolean_mapped_skip_values == False)
& (boolean_mapped_success_values == False)
]
unexpected_index_list = list(unexpected_list.index)
success, percent_success = self._calc_map_expectation_success(
success_count, nonnull_count, mostly
)
我将其解释为忽略包含 null
的行( 不是 将它们添加到 unexpected
列表并且不使用它们来确定 percent_success
).我在我们的代码中删除了一个 pdb
并验证了我们调用期望的数据帧可以以正确的方式操作以获得“合理”数据(test_df.isnull().any(axis=1)
),但出于某种原因Great Expectations 允许那些空值溜走。有人知道为什么吗?
我相信发帖人在这里提交了 Github 问题:https://github.com/great-expectations/great_expectations/issues/2460。可以在那里跟踪进度。