在 for 和 while 循环(包括 awk)中限制与 gnu 并行的分叉 samtools 进程
limit forked samtools processes with gnu parallel within for and while loop including awk
我正在尝试限制并行脚本。脚本的目的是获取10samples/folders以内的列表,并使用列表的记录执行samtools命令,这是难度最大的部分。
这是简单版本:
for (10 items)
do
while read (list 5000 items)
do
command 1
command 2
command 3
...
samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &
### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
done
done
为了使用我们的本地服务器,该脚本包含一个派生命令,该命令有效。但它会分叉,直到服务器的所有资源都被使用并且没有其他人可以在它上面工作。
因此我想用 gnu parallel 实现类似 parallel -j 50
的东西。我在待分叉的samtools
命令前试了一下,比如
parallel -j 50 -k samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &
没有用(也试过反引号),我得到了
[main_samview] region "item_from_list" specifies an unknown reference name. Continue anyway.
或者甚至 vim
都被调用了。但我也不确定这是否是脚本中 parallel
命令的正确位置。您知道如何解决这个问题,以便限制分叉的进程数吗?
我也想过实现这里提到的东西https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop/103921 with the FIFO-based semaphore, but I hoped gnu parallel
can do what I'm looking for? I checked more pages such as https://zvfak.blogspot.de/2012/02/samtools-in-parallel.html and https://davetang.org/muse/2013/11/18/using-gnu-parallel/,但通常不是这些问题的组合。
这是脚本的更详细版本,以防其中的任何命令可能相关(我听说 awk 反引号和新行通常可能是个问题?)
cd path_to_data
for SAMPLE_FOLDER in *
do
cd ${SAMPLE_FOLDER}/another_folder
echo "$SAMPLE_FOLDER was found"
cat list_with_products.txt | while read PRODUCT_NAME_NO_SPACES
do
PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
echo "$PRODUCT_NAME with white spaces"
BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
grep "$PRODUCT_NAME" file_to_search_through > ${TMP_DIR}/tmp.gff
cat ${TMP_DIR}/tmp.gff | some 'awk' command > ${BED_FILENAME}
samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc > resultfolder/resultfile &
### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
rm ${TMP_DIR}/tmp.gff
done
cd (back_to_start)
done
rmdir -p ${OUTPUT_DIR}/tmp
首先制作一个将单个样本+单个产品作为输入的函数:
cd path_to_data
# Process a single sample and a single product name
doit() {
SAMPLE_FOLDER=""
PRODUCT_NAME_NO_SPACES=""
SEQ=""
# Make sure temporary files are named uniquely
# so any parallel job will not overwrite these.
GFF=${TMP_DIR}/tmp.gff-$SEQ
cd ${SAMPLE_FOLDER}/another_folder
echo "$SAMPLE_FOLDER was found"
PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
echo "$PRODUCT_NAME with white spaces"
BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
grep "$PRODUCT_NAME" file_to_search_through > $GFF
cat $GFF | some 'awk' command > ${BED_FILENAME}
samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc
rm $GFF
rmdir -p ${OUTPUT_DIR}/tmp
}
export -f doit
# These variables are defined outside the function and must be exported to be visible
export BED_DIR
export TMP_DIR
export OUTPUT_DIR
# If there are many of these variables, use env_parallel instead of
# parallel. Then you do not need to export the variables.
如果每个样本的list_with_products.txt
都相同:
parallel --results outputdir/ doit {1} {2} {#} ::: * :::: path/to/list_with_products.txt
如果每个样本的 list_with_products.txt
不同:
# Generate a list of:
# sample \t product
parallel --tag cd {}\;cat list_with_products.txt ::: * |
# call doit on each sample,product. Put output in outputdir
parallel --results outputdir/ --colsep '\t' doit {1} {2} {#}
我正在尝试限制并行脚本。脚本的目的是获取10samples/folders以内的列表,并使用列表的记录执行samtools命令,这是难度最大的部分。
这是简单版本:
for (10 items)
do
while read (list 5000 items)
do
command 1
command 2
command 3
...
samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &
### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
done
done
为了使用我们的本地服务器,该脚本包含一个派生命令,该命令有效。但它会分叉,直到服务器的所有资源都被使用并且没有其他人可以在它上面工作。
因此我想用 gnu parallel 实现类似 parallel -j 50
的东西。我在待分叉的samtools
命令前试了一下,比如
parallel -j 50 -k samtools view -L input1 input2 |many_pipes_including_'awk' > output_file &
没有用(也试过反引号),我得到了
[main_samview] region "item_from_list" specifies an unknown reference name. Continue anyway.
或者甚至 vim
都被调用了。但我也不确定这是否是脚本中 parallel
命令的正确位置。您知道如何解决这个问题,以便限制分叉的进程数吗?
我也想过实现这里提到的东西https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop/103921 with the FIFO-based semaphore, but I hoped gnu parallel
can do what I'm looking for? I checked more pages such as https://zvfak.blogspot.de/2012/02/samtools-in-parallel.html and https://davetang.org/muse/2013/11/18/using-gnu-parallel/,但通常不是这些问题的组合。
这是脚本的更详细版本,以防其中的任何命令可能相关(我听说 awk 反引号和新行通常可能是个问题?)
cd path_to_data
for SAMPLE_FOLDER in *
do
cd ${SAMPLE_FOLDER}/another_folder
echo "$SAMPLE_FOLDER was found"
cat list_with_products.txt | while read PRODUCT_NAME_NO_SPACES
do
PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
echo "$PRODUCT_NAME with white spaces"
BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
grep "$PRODUCT_NAME" file_to_search_through > ${TMP_DIR}/tmp.gff
cat ${TMP_DIR}/tmp.gff | some 'awk' command > ${BED_FILENAME}
samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc > resultfolder/resultfile &
### TODO (WARNING): currently all processes are forked at the same time. this needs to be resolved. limit to a certain number of processes.
rm ${TMP_DIR}/tmp.gff
done
cd (back_to_start)
done
rmdir -p ${OUTPUT_DIR}/tmp
首先制作一个将单个样本+单个产品作为输入的函数:
cd path_to_data
# Process a single sample and a single product name
doit() {
SAMPLE_FOLDER=""
PRODUCT_NAME_NO_SPACES=""
SEQ=""
# Make sure temporary files are named uniquely
# so any parallel job will not overwrite these.
GFF=${TMP_DIR}/tmp.gff-$SEQ
cd ${SAMPLE_FOLDER}/another_folder
echo "$SAMPLE_FOLDER was found"
PRODUCT_NAME=`echo ${PRODUCT_NAME_NO_SPACES} | tr "@" " "`
echo "$PRODUCT_NAME with white spaces"
BED_FILENAME=${BED_DIR}/intersect_${PRODUCT_NAME_NO_SPACES}_${SAMPLE_FOLDER}.bed
grep "$PRODUCT_NAME" file_to_search_through > $GFF
cat $GFF | some 'awk' command > ${BED_FILENAME}
samtools view -L ${BED_FILENAME} another_input_file.bam | many | pipes | with | 'awk' | and | perl | etc
rm $GFF
rmdir -p ${OUTPUT_DIR}/tmp
}
export -f doit
# These variables are defined outside the function and must be exported to be visible
export BED_DIR
export TMP_DIR
export OUTPUT_DIR
# If there are many of these variables, use env_parallel instead of
# parallel. Then you do not need to export the variables.
如果每个样本的list_with_products.txt
都相同:
parallel --results outputdir/ doit {1} {2} {#} ::: * :::: path/to/list_with_products.txt
如果每个样本的 list_with_products.txt
不同:
# Generate a list of:
# sample \t product
parallel --tag cd {}\;cat list_with_products.txt ::: * |
# call doit on each sample,product. Put output in outputdir
parallel --results outputdir/ --colsep '\t' doit {1} {2} {#}