如何在 snakemake 管道中使用 pandas

Question

我想通过将一些代码转换为数据管道来提高我制作的一些 python 代码的可重复性。我习惯了 R 中的 targets 并且想在 Python 中找到一个等价物。我的印象是 snakemake 非常接近。

我不明白我们如何使用pandas在snakemake任务中导入输入，修改它然后写output。

让我们采用我能想到的最简单的管道：我们采用 csv 并在其他地方写入副本。

使用 bash 脚本时管道工作正常：

rule trying_snakemake:
    input:
        path="untitled.txt"
    output:
        "test-snakemake.csv"
    run:
        shell("cp {input.path} {output}")

我想用 pandas 的等效方法（当然这里使用 pandas 似乎没有必要，但这是为了理解逻辑）：

rule trying_snakemake:
    input:
        path="untitled.txt"
    output:
        "test-snakemake.csv"
    run:
        import pandas as pd
        df = pd.read_csv({input.path})
        df.to_csv({output}, header=False)

snakemake -c1

Invalid file path or buffer object type: <class 'set'>
  File "/home/jovyan/work/label-openfood/Snakefile", line 19, in __rule_trying_snakemake
  File "/opt/conda/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 482, in _read
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in __init__
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py", line 222, in _open_handles
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/common.py", line 609, in get_handle
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/common.py", line 396, in _get_filepath_or_buffer
  File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Exiting because a job execution failed. Look above for error message

我认为错误出现在 read_csv 步，但我不明白这是什么意思（我已经习惯了 pandas 的情况）

Answer 1

你非常接近，run 指令中不需要花括号：

rule trying_snakemake:
    input:
        path="untitled.txt"
    output:
        csv="test-snakemake.csv"
    run:
        import pandas as pd
        df = pd.read_csv(input.path)
        df.to_csv(output.csv, header=False)

如何在 snakemake 管道中使用 pandas

How to use pandas within snakemake pipelines

python

pipeline

directed-acyclic-graphs

pandas

snakemake