使用 DataCatalog 保存数据

Question

我正在查看 kedro 提供的 iris 项目示例。除了记录准确性外，我还想将 predictions 和 test_y 保存为 csv。

这是kedro提供的示例节点。

def report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:
    """Node for reporting the accuracy of the predictions performed by the
    previous node. Notice that this function has no outputs, except logging.
    """
    # Get true class index
    target = np.argmax(test_y.to_numpy(), axis=1)
    # Calculate accuracy of predictions
    accuracy = np.sum(predictions == target) / target.shape[0]
    # Log the accuracy of the model
    log = logging.getLogger(__name__)
    log.info("Model accuracy on test set: %0.2f%%", accuracy * 100)

我添加了以下内容来保存数据。

data = pd.DataFrame({"target": target , "prediction": predictions})
data_set = CSVDataSet(filepath="data/test.csv")
data_set.save(data)

这是按预期工作的，但是，我的问题是“这是 kedro 做事的方式吗”？我可以在 catalog.yml 中提供 data_set 并稍后将 data 保存到它吗？如果我想这样做，如何从节点内的 catalog.yml 访问 data_set。

有没有像这样 data_set = CSVDataSet(filepath="data/test.csv") 无需在节点内创建目录即可保存数据的方法？如果可能并且遵循 kedro 约定，我希望在 catalog.yml 中使用它！

Answer 1

Kedro 实际上为您抽象了这部分内容。您不需要通过 Python API.

访问数据集

您的 report_accuracy 方法确实需要调整为 return 和 DataFrame 而不是 None。

您的节点需要这样定义：

node(
  func=report_accuracy,
  inputs='dataset_a',
  outputs='dataset_b'
)

Kedro 然后会查看您的目录，并会根据需要 load/save dataset_a 和 dataset_b:

dataset_a:
   type: pandas.CSVDataSet
   path: xxxx.csv

dataset_b:
   type: pandas.ParquetDataSet
   path: yyyy.pq

当您运行时，node/pipeline Kedro 将为您处理 load/save 操作。如果数据集仅在管道中途使用，您也不需要保存每个数据集，您可以阅读 MemoryDataSets here.

使用 DataCatalog 保存数据

Saving data with DataCatalog

python

kedro