合并文件夹中具有相同文件名但一个字符的文件

Question

我有如下文件名：

fastqs/hgmm_100_S1_L001_R1_001.fastq.gz
fastqs/hgmm_100_S1_L002_R1_001.fastq.gz
fastqs/hgmm_100_S1_L003_R1_001.fastq.gz

fastqs/hgmm_100_S1_L001_R2_001.fastq.gz
fastqs/hgmm_100_S1_L002_R2_001.fastq.gz
fastqs/hgmm_100_S1_L003_R2_001.fastq.gz

并且我想将它们合并到上面显示的组中，从而允许合并 LXXX。

我可以这样做：

cat fastqs/hgmm_100_S1_L00?_R1_001.fastq.gz > data/hgmm_100_S1_R1_001.fastq.gz
cat fastqs/hgmm_100_S1_L00?_R2_001.fastq.gz > data/hgmm_100_S1_R2_001.fastq.gz

但这需要我对每个文件组进行硬编码。我如何设置它以便将所有 L 值合并到一个组中并输出一个与输入文件名相同的文件, 只是没有 L?

谢谢，杰克

编辑：

很抱歉没有将其包含在原始 post 中，但如果我有类似的内容怎么办：

fastqs/hgmm_100_S1_L001_R1_001.fastq.gz
fastqs/hgmm_100_S1_L002_R1_001.fastq.gz
fastqs/hgmm_100_S1_L003_R1_001.fastq.gz

fastqs/hgmm_200_S1_L001_R2_001.fastq.gz
fastqs/hgmm_200_S1_L002_R2_001.fastq.gz
fastqs/hgmm_200_S1_L003_R2_001.fastq.gz

(只有变化是最开始的(100 -> 200))

这将如何运作？基本上我想合并这些文件，只要名称的所有部分除了 L???是相同的。

Answer 1

您可以即时进行分组。遍历所有文件并将它们附加到它们的分组文件中。 *和?是有序展开的，所以顺序应该是正确的。

cd fastqs
for f in *_L???_*fastq.gz; do
    cat "$f" >> "../data/${f/_L???_/_}"
done
cd ..

由于文件总是附加的，因此您应该在再次运行此命令之前清除 data/ 目录。

Answer 2

如果模式 _L###_ 仅存在于文件名的那一部分，您可以尝试这样的操作：

#!/usr/bin/env bash

# Define an associative array. Requires bash 4+
declare -A a

# Use extended glob notation. Read the man page or this.
shopt -s extglob

# Collect the file patterns by writing indexes in the array.
for f in fastqs/*_L+([0-9])_*.fastq.gz; do
  a["${f/_L+([0-9])_/_*_}"]=1
done

# And finally, gather your files.
for f in "${!a[@]}"; do
  # Strip any existing directory part of the filename to build our target
  target="data/${f##*/}"
  # Concatenate files matching the glob into our intended target
  cat $f > "${target/[*]_/}"
done

我们使用模式替换将每个 filespec 的可变部分转换为 glob。
我们使用关联数组的索引，因为它可以很容易地保持唯一列表。
${! 让我们遍历数组的索引而不是它的值。

合并文件夹中具有相同文件名但一个字符的文件

Merging files in folder with same file name except one character

unix

bash

merge

cat

fastq