使用 (GNU) make 创建数据分析管道

Question

我是一名科学家，分析从多个受试者收集的大脑数据。在分析过程中，数据经过多个步骤处理，有点像烹饪食谱。在该行的末尾，有一个步骤收集所有个体受试者的处理数据并创建汇总统计数据等。

由于一个步骤最多可能需要一个小时才能完成，因此我希望有一种自动化的方法来运行所有主题的所有步骤并计算汇总统计数据，而无需重复已经完成的步骤已完成。

Make 似乎是一个很好用的实用程序，但我需要一些有关 Makefile 结构的帮助。这是一个简化的例子：

# Keep intermediate files!
.SECONDARY:

# In this simplified example, there are 3 subjects, in reality there are more 
SUBJECTS = subject_a subject_b subject_c

# In this simplified example there are 3 data processing steps, each one taking
# one file as input and emitting one file as output. In reality, there are more
# steps and each step takes multiple input files and emits multiple output
# files.
step1_%.dat : step1.py input_%.dat
    touch step1_$*.dat

step2_%.dat : step2.py step1_%.dat
    touch step2_$*.dat

# Let's say this step produces many output files
STEP3_PROD = step3_%_1.dat step3_%_2.dat step3_%_3.dat
$(STEP3_PROD) : step3.py step2_%.dat
    touch $(STEP3_PROD)

# Meta rule to perform the complete analysis for a single subject
.PHONY : $(SUBJECTS)
subject_% : step1_%.dat step2_%.dat $(STEP3_PROD)
    @echo 'Analysis complete for subject $*.'

# The summary depends on the analysis of all subjects being complete.
summary.dat : summary.py $(SUBJECTS)
    touch summary.dat
    @echo 'All analysis done!'

all : summary.dat

上面 Makefile 的问题是摘要步骤 python summary.py 总是被执行，即使什么都没有改变。这可能是因为它依赖于虚假的 subject_% 规则，该规则始终是 build.

有没有办法构造这个脚本，这样总结步骤就不会被不必要地执行？也许有一些方法可以扩展 $(STEP3_PROD) 所有科目？

Answer 1

不要把事情复杂化，否则会适得其反。尝试类似的东西：

.SECONDARY:

all: summary.dat

SUBJECTS:=a b c
SUBJECT_RULES:=$(addprefix subject_, $(SUBJECTS))
.PHONY: $(SUBJECT_RULES)

subject_a: step3_a_1.dat
subject_b: step3_b_1.dat
subject_c: step3_c_1.dat

step1_%.dat: input_%.dat
    touch $@

step2_%.dat: step1_%.dat
    touch $@

step3_%_1.dat: step2_%.dat
    touch $@

STEP3_PRE:=$(addprefix step3_, $(SUBJECTS))
STEP3_1_OUT:=$(addsuffix _1.dat, $(STEP3_PRE))
STEP3_ALL_OUT:=$(STEP3_1_OUT) \
    $(addsuffix _2.dat, $(STEP3_PRE)) \
    $(addsuffix _3.dat, $(STEP3_PRE))

summary.dat: $(STEP3_1_OUT)
    @echo "summary: $(STEP3_ALL_OUT)"
    touch $@

我认为没有必要跟踪 step3_%_2.dat 等等，因为无论如何它们都是用 step3_%_1.dat 重建的。

使用 (GNU) make 创建数据分析管道

Creating a data analysis pipeline using (GNU) make

makefile

gnu-make

data-analysis