hadoop流中python子进程的输出文件在哪里

Question

我正在使用 hadoop streaming 运行带有 python 子进程的 c++ 可执行文件（一种称为 blast 的生物信息学软件）。 Blast 在命令行执行时会输出一个结果文件。但是在hadoop上运行的时候，找不到blast的输出文件。我想知道，输出文件到哪里去了？？

我的代码（map.py）如下：

# path used on hadoop
tool = './blastx'
reference_path = 'Reference.fa'

# input format example

# >LW1           (contig name)
# ATCGATCGATCG   (sequence)

# samile file: https://goo.gl/XTauAx

(name, seq) = (None, None)

for line in sys.stdin:

    # when detact the ">" sign, assign contig name
    if line[0] == '>':
        name = line.strip()[1:]

    # otherwise, assign the sequence
    else:
        seq = line.strip()

        if name and seq:

            # assign the path of output file
            output_file = join(current_path, 'tmp_output', name)

            # blast command example (export out file to a given path)
            command = 'echo -e \">%s\n%s\" | %s -db %s -out %s -evalue 1e-10 -num_threads 16' % (name, seq, tool, reference_path, output_file)

            # execute command with python subprocess
            cmd = Popen(command, stdin=PIPE, stdout=PIPE, shell=True)

            # retrieve the standard output of command
            cmd_out, cmd_err = cmd.communicate()

            print '%s\t%s' % (name, output_file)

调用blast的命令为：

command = 'echo -e \">%s\n%s\" | %s -db %s -out %s -evalue 1e-10 -num_threads 16' % (name, seq, tool, reference_path, output_file)

通常输出文件在output_file的路径中，但我在本地文件系统和hdfs 上都找不到它们。看起来它们是在一个临时目录中创建的，并在执行后消失。我怎样才能找回它们？

Answer 1

我找到了 blast 的输出文件。似乎它们停留在执行 blast 的节点中。所以我把它们放回hdfs后，我就可以在目录/user/yarn下访问它们了。我所做的是将以下代码添加到 map.py:

command = 'hadoop fs -put %s' % output_file
cmd = Popen(command, stdin=PIPE, stdout=PIPE, shell=True)

并且我也修改了输出路径为

output_file = name

而不是使用

output_file = join(current_path, 'tmp_output', name)

[3/3更新] 但是把文件放到yarn用户目录下就不好了，因为普通用户没有权限编辑该目录下的文件。我建议通过将命令更改为

将文件放入 /tmp/blast_tmp

command = 'hadoop fs -put %s /tmp/blast_tmp' % output_file

在此之前，目录 /tmp/blast_tmp 应该用

创建

% hadoop fs -mkdir /tmp/blast_tmp

并通过

更改目录的权限

% hadoop fs -chmod 777 /tmp/blast_tmp

在这种情况下，用户 yarn 和您都可以访问该目录。

hadoop流中python子进程的输出文件在哪里

Where is the output file of python subprocess in hadoop streaming

python

subprocess

hdfs

blast

hadoop-streaming