Python 机器 Learning/Data 科学项目结构

Question

我正在寻找有关如何组织 Python 机器学习项目的信息。对于 Python 个常用项目，有 Cookiecutter and for R ProjectTemplate。

这是我当前的文件夹结构，但我将 Jupyter Notebooks 与实际 Python 代码混合在一起，看起来不是很清楚。

.
├── cache
├── data
├── my_module
├── logs
├── notebooks
├── scripts
├── snippets
└── tools

我在脚本文件夹中工作，目前在 my_module 下的文件中添加所有函数，但这会导致加载数据（relative/absolute 路径）错误和其他问题。

除了此 kaggle competition solution 和一些在此类笔记本开头浓缩了所有功能的笔记本外，我找不到关于此主题的适当 最佳实践 或好的示例.

Answer 1

你可能想看看：

http://tshauck.github.io/Gloo/

loo's goal is to tie together a lot of the data analysis actions that happen regularly and make that processes easy. Automatically loading data into the ipython environment, running scripts, making utitlity functions available and more. These are things that have to be done often, but aren't the fun part.

它没有得到积极维护，但基础知识已经存在。

Answer 2

我们已经启动了一个专为 Python 数据科学家设计的 cookiecutter-data-science 项目，您可能会感兴趣，请查看 here. Structure is explained here。

如果你有反馈，我会很高兴！欢迎在此处回复、打开 PR 或提交问题。

针对您关于通过将 .py 文件导入笔记本来重用代码的问题，我们团队找到的最有效的方法是附加到系统路径。这可能会让一些人感到畏缩，但这似乎是将代码导入笔记本的最干净的方式，没有大量的模块样板和 pip -e 安装。

一个技巧是将 %autoreload 和 %aimport magics 与上述内容一起使用。这是一个例子：

# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my method from the source code
%aimport preprocess.build_features

以上代码来自 section 3.5 in this notebook 的某些上下文。

Python 机器 Learning/Data 科学项目结构

Python Machine Learning/Data Science Project Structure

python

machine-learning

organization

data-science

kaggle