有没有办法替换所有出现的某些字符，但只替换第 n 行？

Question

我试图用 N 的序列部分中的 N 替换所有不是 C、T、A 或 G 的字符一个 fasta 文件 - 即每 2 行

我认为 awk 和 tr 的某种组合是我需要的...

每隔一行打印一次：

awk '{if (NR % 2 == 0) print [=10=]}' myfile

要用 N

替换这些字符

tr YRHIQ- N

...但我不知道如何将它们组合起来，以便字符替换仅出现在每第二行，但它会打印每一行

这就是我拥有的那种东西

>SEQUENCE_1
AGCYGTQA-TGCTG
>SEQUENCE_2
AGGYGTQA-TGCTC

我希望它看起来像这样：

>SEQUENCE_1
AGCNGTNANTGCTG
>SEQUENCE_2
AGGNGTNANTGCTC

但不是这样的：

>SENUENCE_1
AGCNGTNANTGCTG
>SENUENCE_2
AGGNGTNANTGCTC

Answer 1

感谢@kvantour 对 fasta 文件的解释，这里是另一个 sed 解决方案，它比旧的更适合您的任务：

sed '/^>/! s/[^ACTG]/N/g' file.fasta

/^>/!: 如果此行不以 >,
s/[^ACTG]/N/g: 将 ACTG 以外的所有字符替换为 N.

Answer 2

这是 awk

的一种解决方案

awk 'NR%2 ==0{gsub(/[^CTAG]/, "N")}1' file

结果

SEQUENCE_1
AGCNGTNANTGCTG
SEQUENCE_2
AGGNGTNANTGCTC

说明正如 OP 所希望的那样，我只是在寻找每一行以通过
应用更改 NR/2 == 0

NR 是到目前为止从 file

读取的记录数（此处为行）

和 gsub(/[^CTAG]/, "N") 替换为非 'C'、'T'、'A'、'G'

的所有字符

[^CTAG] ^ 是否定

和awk过去了 expression action 格式

这里的 expression 是 NR/2==0 并且操作是将 N 的字符替换为 gsub 而不是 CTAG

Answer 3

您的问题很容易回答，但在处理通用 fasta 文件时对您没有帮助。 Fasta 文件有一个序列 header 后跟一个或多个可以连接起来表示序列的行。 Fastafile-format大致遵循以下规则：

The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.

Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).

The sequence can span multiple lines.

A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.

回答 OP 的问题，如果你只想处理每一行，你想做：

awk '!(NR%2){gsub(/[^CTAG]/, "N")}1' file.fasta

但是，在以下任何情况下，此方法都会失败：

带有 multi-line 序列的 fasta 文件
multi-fasta 文件，后续序列之间可能 blank-line

更好的方法是排除 header 行并处理所有其他行：

awk '!/^>/{gsub(/[^CTAG]/, "N")}1' file.fasta

有没有办法替换所有出现的某些字符，但只替换第 n 行？

Is there a way to replace all occurrances of certain characters but only on every nth line?

awk

tr

fasta