使用 Great Expectations 验证 pandas DataFrame 与现有套件 JSON

Use Great Expectations to validate pandas DataFrame with existing suite JSON

我正在使用 Great Expectations python 程序包(版本 0.14.10)来验证一些数据。我已经按照提供的教程进行操作,并在本地 ./great_expectations 文件夹中创建了一个 great_expectations.yml。我还根据数据的 .csv 文件版本创建了一个 great expectations 套件(将此文件命名为 ge_suite.json)。

目标:我想使用 ge_suite.json 文件来验证内存中的 pandas DataFrame。

我尝试使用如下所示的代码跟随

import great_expectations as ge
import pandas as pd
from ruamel import yaml
from great_expectations.data_context import DataContext

context = DataContext()
df = pd.read_pickle('/path/to/my/df.pkl')
batch_kwargs = {"datasource": "my_datasource_name", "dataset": df}
batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="ge_suite")

我的 great_expectations.yml 文件的数据源部分如下所示:

datasources:
  my_datasource_name:
    execution_engine:
      module_name: great_expectations.execution_engine
      class_name: PandasExecutionEngine
    module_name: great_expectations.datasource
    class_name: Datasource
    data_connectors:
      default_inferred_data_connector_name:
        module_name: great_expectations.datasource.data_connector
        base_directory: /tmp
        class_name: InferredAssetFilesystemDataConnector
        default_regex:
          group_names:
            - data_asset_name
          pattern: (.*)
      default_runtime_data_connector_name:
        batch_identifiers:
          - default_identifier_name
        module_name: great_expectations.datasource.data_connector
        class_name: RuntimeDataConnector

当我 运行 在 python 中执行 batch = context.get_batch(... 命令时,出现以下错误:

File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1655, in get_batch
  return self._get_batch_v2(
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1351, in _get_batch_v2
  batch = datasource.get_batch(
AttributeError: 'Datasource' object has no attribute 'get_batch'

我假设我需要在 great_expectations.yml 文件中的数据源定义中添加一些内容来解决这个问题。或者,这可能是版本控制问题吗?我不确定。我在在线文档中看了一会儿,没有找到答案。我如何实现“目标”(如上定义)并克服此错误?

如果您想验证 in-memory pandas 数据框,您可以参考以下 2 页以获取有关如何执行此操作的信息:

https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/pandas/

https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/how_to_create_a_batch_of_data_from_an_in_memory_spark_or_pandas_dataframe/

不过,要在代码中给出一个具体示例,您可以这样做:

import great_expectations as ge
import os
import pandas as pd
from great_expectations.core.batch import RuntimeBatchRequest

context = ge.get_context()
df = pd.read_pickle('/path/to/my/df.pkl')

suite_name = 'ge_suite'
data_asset_name = 'your_data_asset_name'
batch_id = 'your_batch_id'

batch_request = RuntimeBatchRequest(datasource_name="my_datasource_name", 
                                    data_connector_name="default_runtime_data_connector_name",
                                    data_asset_name=data_asset_name,
                                    runtime_parameters={"batch_data": df},
                                    batch_identifiers={"default_identifier_name": batch_id}, )

# context.run_checkpoint method looks for checkpoint file on disk.  Create one...
checkpoint_name = 'your_checkpoint_name'
checkpoint_path = os.path.abspath(f'./great_expectations/checkpoints/{checkpoint_name}.yml')
checkpoint_yml = f'''
name: {checkpoint_name}
config_version: 1
class_name: SimpleCheckpoint
expectation_suite_name: {suite_name}
'''
with open(checkpoint_path, 'w') as f:
    f.write(checkpoint_yml)

result = context.run_checkpoint(
    checkpoint_name=checkpoint_name,
    validations=[{"batch_request": batch_request, 'expectation_suite_name': suite_name}, ],
)