如何从列中提取一些文本以创建新列

Question

亲爱的 Whosebug 社区，

我有一个 2 列的文件，如下所示：

Ccrux.00013.c0_g1_i1    .
Ccrux.00013.c0_g2_i1    .
Ccrux.00014.c0_g1_i1    .
Ccrux.00014.c0_g2_i1    .
Ccrux.00015.c0_g1_i1    .
Ccrux.00015.c0_g1_i1    GO:0005789^cellular_component^endoplasmic reticulum membrane`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0005506^molecular_function^iron ion binding`GO:0031418^molecular_function^L-ascorbic acid binding`GO:0016706^molecular_function^oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, 2-oxoglutarate as one donor, and incorporation of one atom each of oxygen into both donors`GO:0045646^biological_process^regulation of erythrocyte differentiation
Ccrux.00015.c0_g2_i1    GO:0005789^cellular_component^endoplasmic reticulum membrane`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0005506^molecular_function^iron ion binding`GO:0031418^molecular_function^L-ascorbic acid binding`GO:0016706^molecular_function^oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, 2-oxoglutarate as one donor, and incorporation of one atom each of oxygen into both donors`GO:0045646^biological_process^regulation of erythrocyte differentiation
Ccrux.00016.c0_g1_i1    .
Ccrux.00016.c0_g2_i1    .
Ccrux.00017.c0_g1_i1    .
Ccrux.00018.c0_g1_i1    .
Ccrux.00019.c0_g1_i1    .

我需要一个新的 2 列文件：

不包含第 2 列值为的行。
仅包含 GO:XXXXXXX 作为第二列值（即从第二列中删除所有文本并仅保留 GO 编号）

新文件应如下所示：

Ccrux.00015.c0_g1_i1    GO:0005789,GO:0016021,GO:0005509,GO:0005506,GO:0031418,GO:0016706,GO:0045646
Ccrux.00015.c0_g2_i1    GO:0005789,GO:0016021,GO:0005509,GO:0005506,GO:0031418,GO:0016706,GO:0045646
Ccrux.00029.c0_g1_i1    GO:0035869,GO:0005737,GO:0005615,GO:0016020,GO:0021956,GO:0060271,GO:0021904,GO:0001701,GO:0001841,GO:0008589,GO:0021523,GO:0021537

我一直在尝试使用 perl：

perl -ne '/(GO:\d+)/ && print ""' input.file > output.file

但只在一栏中打印了所有 GO 编号。我真的不知道该怎么做。我们将非常欢迎任何建议。

提前谢谢大家。

Answer 1

你得到的模式匹配一段文本，然后打印出来。

听起来你在做什么：

GO:0005789^cellular_component^endoplasmic reticulum membrane`

您要删除 ^ 和下一个 GO 之间的任何 'bits'？

perl 的好处是语法 -ne 只是在命令周围创建一个 while 小循环 - 所以它可以让你执行多个语句。

所以 - 扩展示例：

#!/usr/bin/env perl 
use strict;
use warnings;

while (<DATA>) {
    next unless m/GO/;
    s/\^[^`]+`/,/g;
    s/\^[^`]+$/\n/g;
    print;
}

__DATA__
Ccrux.00013.c0_g1_i1    .
Ccrux.00013.c0_g2_i1    .
Ccrux.00014.c0_g1_i1    .
Ccrux.00014.c0_g2_i1    .
Ccrux.00015.c0_g1_i1    .
Ccrux.00015.c0_g1_i1    GO:0005789^cellular_component^endoplasmic reticulum membrane`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0005506^molecular_function^iron ion binding`GO:0031418^molecular_function^L-ascorbic acid binding`GO:0016706^molecular_function^oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, 2-oxoglutarate as one donor, and incorporation of one atom each of oxygen into both donors`GO:0045646^biological_process^regulation of erythrocyte differentiation
Ccrux.00015.c0_g2_i1    GO:0005789^cellular_component^endoplasmic reticulum membrane`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0005506^molecular_function^iron ion binding`GO:0031418^molecular_function^L-ascorbic acid binding`GO:0016706^molecular_function^oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, 2-oxoglutarate as one donor, and incorporation of one atom each of oxygen into both donors`GO:0045646^biological_process^regulation of erythrocyte differentiation
Ccrux.00016.c0_g1_i1    .
Ccrux.00016.c0_g2_i1    .
Ccrux.00017.c0_g1_i1    .
Ccrux.00018.c0_g1_i1    .
Ccrux.00019.c0_g1_i1    .

这将生成输出：

Ccrux.00015.c0_g1_i1    GO:0005789,GO:0016021,GO:0005509,GO:0005506,GO:0031418,GO:0016706,GO:0045646
Ccrux.00015.c0_g2_i1    GO:0005789,GO:0016021,GO:0005509,GO:0005506,GO:0031418,GO:0016706,GO:0045646

我们：

跳过任何没有单词 GO 的行。
替换文字 ^ 的任何实例，一个或多个不 ^ 然后反引号 - 用逗号替换。
并用 \n 替换在行尾终止的相同内容。

我们可以这样将其压缩为一个衬里：

perl -ne 'next unless m/GO/;s/\^[^`]+`/,/g;s/\^[^`]+$/\n/g;print' inputfile > outputfile

或者更好 - 没有打印 - 请参阅 perlrun - -p 类似于 -n 但它建立在 print 中（所以更像sed)。

perl -pe 'next unless m/GO/;s/\^[^`]+`/,/g;s/\^[^`]+$/\n/g;' inputfile > outputfile

Answer 2

我认为您的要求对于单行解决方案来说有点太长了，但它可以非常简短。该程序将产生您描述的输出。它期望输入文件的路径作为命令行上的参数

use strict;
use warnings;

while ( <> ) {
    next unless my @values = /GO:\d+/g;
    local $" = ',';
    s/\S\s+\K.+/@values/;
    print;
}

单行版会有点笨拙

perl -pe '@v=/GO:\d+/g or next; $"=","; s/\S\s+\K.+/@v/; print;' myfile > newfile

如何从列中提取一些文本以创建新列

How to extract some text from a column to create a new column

perl

text-extraction

selection