如何使用 pandas 和 pytest 进行 TDD?

How to TDD with pandas and pytest?

我有一个 Python 脚本,它通过在一系列 DataFrame 操作(drop、groupby、sum 等)中一直使用 Pandas 来合并报告。假设我从一个简单的函数开始,该函数清除所有没有值的列,它有一个 DataFrame 作为输入和输出:

# cei.py
def clean_table_cols(source_df: pd.DataFrame) -> pd.DataFrame:
   # IMPLEMENTATION
   # eg. return source_df.dropna(axis="columns", how="all")

我想在我的测试中验证这个函数实际上删除了所有值都为空的列。所以我安排了一个测试输入和输出,并使用 pandas.testing:

中的 assert_frame_equal 函数进行测试
# test_cei.py
import pandas as pd
def test_clean_table_cols() -> None:
    df = pd.DataFrame(
        {
            "full_valued": [1, 2, 3],
            "all_missing1": [None, None, None],
            "some_missing": [None, 2, 3],
            "all_missing2": [None, None, None],
        }
    )
    expected = pd.DataFrame({"full_valued": [1, 2, 3], "some_missing": [None, 2, 3]})
    result = cei.clean_table_cols(df)
    pd.testing.assert_frame_equal(result, expected)

我的问题是它在概念上是单元测试还是 e2e/integration 测试,因为我不是在嘲笑 pandas 实现。但是如果我模拟 DataFrame,我就不会测试代码的功能。按照 TDD 最佳实践进行测试的推荐方法是什么?

注意:在这个项目中使用 Pandas 是一个设计决定,因此我们无意抽象 Pandas 接口以便将来用其他库替换它。

是的,这段代码实际上是一个集成测试,这可能不是一件坏事。

即使使用 pandas 是一个固定的设计决策,仍然有很多充分的理由从外部库中抽象出来 测试就是其中之一。从外部库抽象允许独立于库测试业务逻辑。在这种情况下,从 pandas 中抽象出来会使上面的代码成为一个单元测试。它将测试 与库的交互

要应用此模式,我建议您查看 ports and adapters architecture pattern

但是,这确实意味着您不再测试 pandas 提供的功能。如果这仍然是您的特定意图,那么集成测试是一个不错的解决方案。

您可能会发现 tdda(测试驱动数据分析)很有用,引用自文档:

The tdda package provides Python support for test-driven data analysis (see 1-page summary with references, or the blog). The tdda.referencetest library is used to support the creation of reference tests, based on either unittest or pytest. The tdda.constraints library is used to discover constraints from a (Pandas) DataFrame, write them out as JSON, and to verify that datasets meet the constraints in the constraints file. It also supports tables in a variety of relation databases. There is also a command-line utility for discovering and verifying constraints, and detecting failing records. The tdda.rexpy library is a tool for automatically inferring regular expressions from a column in a Pandas DataFrame or from a (Python) list of examples. There is also a command-line utility for Rexpy. Although the library is provided as a Python package, and can be called through its Python API, it also provides command-line tools."

另见 Nick Radcliffe's PyData talk on Test-Driven Data Analysis