使用 Great Expectations 验证 pandas DataFrame 与现有套件 JSON
Use Great Expectations to validate pandas DataFrame with existing suite JSON
我正在使用 Great Expectations python 程序包(版本 0.14.10)来验证一些数据。我已经按照提供的教程进行操作,并在本地 ./great_expectations
文件夹中创建了一个 great_expectations.yml。我还根据数据的 .csv 文件版本创建了一个 great expectations 套件(将此文件命名为 ge_suite.json
)。
目标:我想使用 ge_suite.json
文件来验证内存中的 pandas DataFrame。
我尝试使用如下所示的代码跟随 :
import great_expectations as ge
import pandas as pd
from ruamel import yaml
from great_expectations.data_context import DataContext
context = DataContext()
df = pd.read_pickle('/path/to/my/df.pkl')
batch_kwargs = {"datasource": "my_datasource_name", "dataset": df}
batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="ge_suite")
我的 great_expectations.yml 文件的数据源部分如下所示:
datasources:
my_datasource_name:
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
module_name: great_expectations.datasource
class_name: Datasource
data_connectors:
default_inferred_data_connector_name:
module_name: great_expectations.datasource.data_connector
base_directory: /tmp
class_name: InferredAssetFilesystemDataConnector
default_regex:
group_names:
- data_asset_name
pattern: (.*)
default_runtime_data_connector_name:
batch_identifiers:
- default_identifier_name
module_name: great_expectations.datasource.data_connector
class_name: RuntimeDataConnector
当我 运行 在 python 中执行 batch = context.get_batch(...
命令时,出现以下错误:
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1655, in get_batch
return self._get_batch_v2(
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1351, in _get_batch_v2
batch = datasource.get_batch(
AttributeError: 'Datasource' object has no attribute 'get_batch'
我假设我需要在 great_expectations.yml 文件中的数据源定义中添加一些内容来解决这个问题。或者,这可能是版本控制问题吗?我不确定。我在在线文档中看了一会儿,没有找到答案。我如何实现“目标”(如上定义)并克服此错误?
如果您想验证 in-memory pandas 数据框,您可以参考以下 2 页以获取有关如何执行此操作的信息:
https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/pandas/
不过,要在代码中给出一个具体示例,您可以这样做:
import great_expectations as ge
import os
import pandas as pd
from great_expectations.core.batch import RuntimeBatchRequest
context = ge.get_context()
df = pd.read_pickle('/path/to/my/df.pkl')
suite_name = 'ge_suite'
data_asset_name = 'your_data_asset_name'
batch_id = 'your_batch_id'
batch_request = RuntimeBatchRequest(datasource_name="my_datasource_name",
data_connector_name="default_runtime_data_connector_name",
data_asset_name=data_asset_name,
runtime_parameters={"batch_data": df},
batch_identifiers={"default_identifier_name": batch_id}, )
# context.run_checkpoint method looks for checkpoint file on disk. Create one...
checkpoint_name = 'your_checkpoint_name'
checkpoint_path = os.path.abspath(f'./great_expectations/checkpoints/{checkpoint_name}.yml')
checkpoint_yml = f'''
name: {checkpoint_name}
config_version: 1
class_name: SimpleCheckpoint
expectation_suite_name: {suite_name}
'''
with open(checkpoint_path, 'w') as f:
f.write(checkpoint_yml)
result = context.run_checkpoint(
checkpoint_name=checkpoint_name,
validations=[{"batch_request": batch_request, 'expectation_suite_name': suite_name}, ],
)
我正在使用 Great Expectations python 程序包(版本 0.14.10)来验证一些数据。我已经按照提供的教程进行操作,并在本地 ./great_expectations
文件夹中创建了一个 great_expectations.yml。我还根据数据的 .csv 文件版本创建了一个 great expectations 套件(将此文件命名为 ge_suite.json
)。
目标:我想使用 ge_suite.json
文件来验证内存中的 pandas DataFrame。
我尝试使用如下所示的代码跟随
import great_expectations as ge
import pandas as pd
from ruamel import yaml
from great_expectations.data_context import DataContext
context = DataContext()
df = pd.read_pickle('/path/to/my/df.pkl')
batch_kwargs = {"datasource": "my_datasource_name", "dataset": df}
batch = context.get_batch(batch_kwargs=batch_kwargs, expectation_suite_name="ge_suite")
我的 great_expectations.yml 文件的数据源部分如下所示:
datasources:
my_datasource_name:
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
module_name: great_expectations.datasource
class_name: Datasource
data_connectors:
default_inferred_data_connector_name:
module_name: great_expectations.datasource.data_connector
base_directory: /tmp
class_name: InferredAssetFilesystemDataConnector
default_regex:
group_names:
- data_asset_name
pattern: (.*)
default_runtime_data_connector_name:
batch_identifiers:
- default_identifier_name
module_name: great_expectations.datasource.data_connector
class_name: RuntimeDataConnector
当我 运行 在 python 中执行 batch = context.get_batch(...
命令时,出现以下错误:
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1655, in get_batch
return self._get_batch_v2(
File "/Users/username/opt/miniconda3/envs/myenv/lib/python3.8/site-packages/great_expectations/data_context/data_context.py", line 1351, in _get_batch_v2
batch = datasource.get_batch(
AttributeError: 'Datasource' object has no attribute 'get_batch'
我假设我需要在 great_expectations.yml 文件中的数据源定义中添加一些内容来解决这个问题。或者,这可能是版本控制问题吗?我不确定。我在在线文档中看了一会儿,没有找到答案。我如何实现“目标”(如上定义)并克服此错误?
如果您想验证 in-memory pandas 数据框,您可以参考以下 2 页以获取有关如何执行此操作的信息:
https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/pandas/
不过,要在代码中给出一个具体示例,您可以这样做:
import great_expectations as ge
import os
import pandas as pd
from great_expectations.core.batch import RuntimeBatchRequest
context = ge.get_context()
df = pd.read_pickle('/path/to/my/df.pkl')
suite_name = 'ge_suite'
data_asset_name = 'your_data_asset_name'
batch_id = 'your_batch_id'
batch_request = RuntimeBatchRequest(datasource_name="my_datasource_name",
data_connector_name="default_runtime_data_connector_name",
data_asset_name=data_asset_name,
runtime_parameters={"batch_data": df},
batch_identifiers={"default_identifier_name": batch_id}, )
# context.run_checkpoint method looks for checkpoint file on disk. Create one...
checkpoint_name = 'your_checkpoint_name'
checkpoint_path = os.path.abspath(f'./great_expectations/checkpoints/{checkpoint_name}.yml')
checkpoint_yml = f'''
name: {checkpoint_name}
config_version: 1
class_name: SimpleCheckpoint
expectation_suite_name: {suite_name}
'''
with open(checkpoint_path, 'w') as f:
f.write(checkpoint_yml)
result = context.run_checkpoint(
checkpoint_name=checkpoint_name,
validations=[{"batch_request": batch_request, 'expectation_suite_name': suite_name}, ],
)