使用 linux 中的拆分函数将多个输入文件拆分为多个输出

Question

我有 8 个文件，我想将每个文件分成 5 个块。我通常会单独执行此操作，但希望运行这是一个循环。我在 HPC 工作。

我创建了一个文件名列表并将其标记为“variantlist.txt”。我的代码是：

for f in 'cat variantlist.txt'; do split ${f} -n 5 -d; done

但是，它只拆分 variantlist.txt 文件中的最终文件，仅从最终条目输出 5 个块。

即使我单独列出文件:

for f in chr001.vcf chr002 ...chr008.vcf ; do split ${f} -n 5 -d; done

它仍然只将最终文件分成 5 个块。

不确定我哪里出错了。所需的输出将是 40 个块，每个染色体 5 个。非常感谢您的帮助。

非常感谢

Answer 1

当使用 split 时，-n swicth 将决定原始文件被分割成的输出文件的数量...

你需要 -l 作为你需要的行数，在你的例子中是 5:

 split -l 5 ${f}

Answer 2

拆分是每次创建同一组文件并覆盖以前的文件。这是处理该问题的一种方法 -

for f in $(<variantlist.txt)  # don't use cat
do  mkdir -p $f.split         # make a subdir for the files
    ( cd $f.split &&          # change into the subdir only in a subshell
      split ../$f -n 5 -d     # split from there
    )                         # close the subshell, parent still in base dir
done

或者你可以这样做 -

while read f             # grab each filename
do split $f -n 5 -d      # split it
   for x in x??          # for each split file
   do mv $x $f.$x        # rename it to include the parent file name
   done
done < variantlist.txt   # take names from this file

这要慢很多，但不使用子目录。

虽然我最喜欢 -

xargs -I {} split {} -n 5 -d {} < variantlist.txt

最后一个参数成为 split 的前缀，而不是 x 的默认值。

EDIT -- 每个文件有 20 亿行，使用这个：

for f in $(<variantlist.txt)
do split "$f" -d -n 5 "$f" & # run all in background at the same time
done

使用 linux 中的拆分函数将多个输入文件拆分为多个输出

Splitting multiple input files into multiple outputs using split function in linux

bash

split

loops

for-loop

vcf-variant-call-format