如何让Snakemake 使用Globus CLI 识别Globus 远程文件？

Question

我在高性能计算网格环境中工作，其中大规模数据传输是通过 Globus. I would like to use Snakemake to pull data from a Globus path, process the data, and then push the processed data to a different Globus path. Globus has a command-line interface 完成的。

提取数据没有问题，因为我只是创建了一个规则运行 globus transfer 来创建必需的本地文件。但是为了将数据推回 Globus，我想我需要一个规则来“看到”文件在远程位置丢失，然后向后工作以确定创建文件需要发生什么。

我可以创建代表远程文件的本地“代理”文件。例如，我可以制定在目录中创建 'processed_data_1234.tar.gz' 输出文件的规则。这些文件将仅使用 touch 创建（因此为空），并且相同的规则将运行 globus transfer 远程推送文件。但是，确保代理文件不会与真正的 Globus 托管文件不同步会产生开销。

有没有更优雅的方法来实现类似于 Remote File 的能力？为 Snakemake 添加 Globus CLI 支持难吗？在此先感谢您的任何建议！

Answer 1

创建一个实用函数来生成所有所需文件的列表并将其与 globus 上的可用文件列表进行比较是否有帮助？像这样（伪代码）：

def return_needed_files():
    list_needed_files = [] # either hard-coded or specified with some logic
    list_available = [] # as appropriate, e.g. using globus ls
    return [i for i in list_needed_files if i not in list_available]

# include all the needed files in the all rule
rule all:
    input: return_needed_files

如何让Snakemake 使用Globus CLI 识别Globus 远程文件？

How to make Snakemake recognize Globus remote files using Globus CLI?

hpc

globus-toolkit

snakemake