将行中的数据转换为列

Question

我输入的制表符分隔文件是这样的：

13435    830169  830264  a    95   y    16
09433    835620  835672  x    46
30945    838405  838620  a    21   c    19
94853    850475  850660  y    15
04958    865700  865978  c    16   a    98

在前三列之后，文件在下一列中显示变量及其值。我需要更改数据结构，以便在前三列之后，有这样的变量列：

                         a    x    y    c   
13435    830169  830264  95        16
09433    835620  835672       46
30945    838405  838620  21             19
94853    850475  850660            15
04958    865700  865978  98             16

在 linux 上是否有任何代码可以执行此操作？文件大小为 7.6 MB，总行数约为 450,000 行。变量总数为四个。

谢谢

Answer 1

如果您知道您有 4 个变量 a、x、y、c，并且文件格式为制表符分隔文件，并且您想要如输出所示的确切格式，您可以简单地使用“Brute Force”方法，在该方法中检查字段 4 和 6 的内容以获取变量名称和输出字段 5 或 7 的值格式如使用 printf.

所示

例如，知道变量名称后，您可以简单地输出 header 行，然后按如下方式处理每条记录：

awk -F"\t" '
  FNR==1 { 
    print "\t\t\t  a    x    y    c"
  }
  {
    printf "%-8s%8s%8s  ", , , 
    
    if (=="a")
      printf "%-5s", 
    else if (=="a")
      printf "%-5s", 
    else
      printf "%-5s", " "
    
    if (=="x")
      printf "%-5s", 
    else if (=="x")
      printf "%-5s", 
    else
      printf "%-5s", " "
    
    if (=="y")
      printf "%-5s", 
    else if (=="y")
      printf "%-5s", 
    else
      printf "%-5s", " "
    
    if (=="c")
      printf "%-5s\n", 
    else if (=="c")
      printf "%-5s\n", 
    else
      print ""
  }
' tabfile

例子Use/Output

根据您在 tabfile 中的输入，您将拥有：

$ awk -F"\t" '
>   FNR==1 {
>     print "\t\t\t  a    x    y    c"
>   }
>   {
>     printf "%-8s%8s%8s  ", , , 
>
>     if (=="a")
>       printf "%-5s", 
>     else if (=="a")
>       printf "%-5s", 
>     else
>       printf "%-5s", " "
>
>     if (=="x")
>       printf "%-5s", 
>     else if (=="x")
>       printf "%-5s", 
>     else
>       printf "%-5s", " "
>
>     if (=="y")
>       printf "%-5s", 
>     else if (=="y")
>       printf "%-5s", 
>     else
>       printf "%-5s", " "
>
>     if (=="c")
>       printf "%-5s\n", 
>     else if (=="c")
>       printf "%-5s\n", 
>     else
>       print ""
>   }
> ' tabfile
                          a    x    y    c
13435     830169  830264  95        16
09433     835620  835672       46
30945     838405  838620  21             19
94853     850475  850660            15
04958     865700  865978  98             16

它提供了所需的输出。这种 one-pass 方法对于 450,000 行输入也将非常有效。由于这对于 command-line 脚本来说有点长，您可以简单地将它放在 awk 脚本中并使用文件名调用它。如果您有任何问题，请告诉我。

作为脚本文件

用作脚本文件，只需将内容放入文件中并使其可执行，例如

#!/usr/bin/awk -f

BEGIN { FS="\t" }
FNR==1 { 
  print "\t\t\t  a    x    y    c"
}
{
  printf "%-8s%8s%8s  ", , , 
  
  if (=="a")
    printf "%-5s", 
  else if (=="a")
    printf "%-5s", 
  else
    printf "%-5s", " "
  
  if (=="x")
    printf "%-5s", 
  else if (=="x")
    printf "%-5s", 
  else
    printf "%-5s", " "
  
  if (=="y")
    printf "%-5s", 
  else if (=="y")
    printf "%-5s", 
  else
    printf "%-5s", " "
  
  if (=="c")
    printf "%-5s\n", 
  else if (=="c")
    printf "%-5s\n", 
  else
    print ""
}

另存为 awkscript 你会 chmod +x awkscript 然后运行:

$ ./awkscript tabfile
                          a    x    y    c
13435     830169  830264  95        16
09433     835620  835672       46
30945     838405  838620  21             19
94853     850475  850660            15
04958     865700  865978  98             16

Answer 2

input="\
13435   830169  830264  a   95  y   16
09433   835620  835672  x   46
30945   838405  838620  a   21  c   19
94853   850475  850660  y   15
04958   865700  865978  c   16  a   98
"

与 awk:

printf '\t\t\ta\tx\ty\tc\n'
echo -n "$input" |
awk -v vars='a x y c' '
  BEGIN {NV = split(vars,V)}
  {
     s =  "\t"  "\t" ;
     delete a;
     for(i = 4; i < NF; i = i+2) a[$i] = $(i+1);
     for(i = 1; i <= NV; i++) s = s "\t" a[V[i]];
     print s
  }
'

与 ruby:

printf '\t\t\ta\tx\ty\tc\n'
echo -n "$input" |
vars='a x y c' ruby -ane '
    BEGIN{v = ENV["vars"].split};
    h = Hash[*$F[3..-1]];
    puts $F[0..2].concat(v.map{|v| h[v]}).join("\t")
'

输出：

            a   x   y   c
13435   830169  830264  95      16  
09433   835620  835672      46      
30945   838405  838620  21          19
94853   850475  850660          15  
04958   865700  865978  98          16

Answer 3

在perl中：

$ perl -lane '
    BEGIN { print join("\t", "", "", "", "a", "x", "y", "c"); }
    my %vars = @F[3..$#F];
    print join("\t", @F[0..2], @vars{qw/a x y c/});
  ' input.tsv
                        a       x       y       c
13435   830169  830264  95              16
09433   835620  835672          46
30945   838405  838620  21                      19
94853   850475  850660                  15
04958   865700  865978  98                      16

第四列和后面的所有内容被作为散列table的key/value对，然后以正确的顺序提取其中存在的变量值，以及第一个三列。大量使用 slices.

Answer 4

纯 bash（需要 bash 4.0 或更新版本）：

#!/bin/bash

declare -A var

printf '\t\t\ta\tx\ty\tc\n'
while IFS=$'\t' read -ra fld; do
    var[a]=""  var[x]=""  var[y]=""  var[c]=""
    for ((i = 3; i < ${#fld[@]}; i += 2)); do
        var["${fld[i]}"]=${fld[i + 1]}
    done
    printf '%s\t' "${fld[@]:0:3}"
    printf '%s\t%s\t%s\t%s\n' "${var[a]}" "${var[x]}" "${var[y]}" "${var[c]}"
done < file

Answer 5

这是执行此操作的 awk：

awk '
BEGIN{fmt="%s\t%s\t%s\t%s\t%s\t%s\t%s\n"}
NR==FNR{if ( && !( in seen)) {
            seen[]=++col; cols[col]=
        }
        if ( && !( in seen)) {
            seen[]=++col; cols[col]=
        }
    next
}
FNR==1{printf fmt, "\t","\t","\t",cols[1],cols[2],cols[3],cols[4]}
{   split("",fields)
    fields[seen[]]=; fields[seen[]]=
    printf fmt, ,,,fields[1],fields[2],fields[3],fields[4]
}
' file file

这将找到变量名称并按第一次看到的顺序打印它们。

打印：

                        a   y   x   c
13435   830169  830264  95  16      
09433   835620  835672          46  
30945   838405  838620  21          19
94853   850475  850660      15      
04958   865700  865978  98          16

如果你事先知道你的变量并且想说明列的顺序，你可以这样做：

awk '
BEGIN{
    fmt="%s\t%s\t%s\t%s\t%s\t%s\t%s\n"
    seen["a"]=1;seen["x"]=2;seen["y"]=3;seen["c"]=4
}

FNR==1{printf fmt, "\t","\t","\t","a","x","y","c"}
{   split("",fields)
    fields[seen[]]=; fields[seen[]]=
    printf fmt, ,,,fields[1],fields[2],fields[3],fields[4]
}
' file

打印：

                        a   x   y   c
13435   830169  830264  95      16  
09433   835620  835672      46      
30945   838405  838620  21          19
94853   850475  850660          15  
04958   865700  865978  98          16

Answer 6

假设：

事先不知道四个变量名称（样本输入中的a/c/x/y）
变量后面总是有一个非空值
事先不知道 variable/value 对的数量（在单个输入线上）
OP 可以按字母顺序打印变量列（OP 的所需输出未指定 if/how 要对四个变量列进行排序）
行的顺序保持不变（输入顺序 == 输出顺序）
host 有足够的内存来保存整个输入文件（通过 awk 数组）；这允许单次输入文件；如果内存是一个问题（即输入文件无法放入内存），那么将需要一个不同的 coding/design（未在此答案中解决）

另一个awk想法...需要GNU awk使用多维数组以及PROCINFO["sorted_in"]构造：

awk '
BEGIN { FS=OFS="\t" }                             # input/output field delimiters = <tab>

      { first3[FNR]= OFS  OFS               # store first 3 fields

        for (i=4;i<=NF;i=i+2) {                   # loop through rest of fields, 2 at a time
            vars[$i]                              # keep track of variable names
            values[FNR][$i]=$(i+1)                # store the value for this line/variable combo
        }
      }

END   { PROCINFO["sorted_in"]="@ind_str_asc"      # sort vars[] indexes in ascending order

        printf "%s%s", OFS, OFS                   # start printing header line ...
        for (v in vars)                           # loop through variable names ...
            printf "%s%s", OFS, v                 # printing to header line
        printf "\n"                               # terminate header line

        for (i=1;i<=FNR;i++) {                    # loop through our set of lines ...
            printf "%s",first3[i]                 # print the 1st 3 fields and then ...
            for (v in vars)                       # loop through list of all variables ...
                printf "%s%s",OFS,values[i][v]    # printing the associated value; non-existent values default to the empty string ""
            printf "\n"                           # terminate the current line of output
        }
      }
' inputfile

注意：此设计允许处理可变数量的变量。

出于演示目的，我们将使用以下制表符分隔的输入文件：

$ cat input4                                         # OP's sample input file w/ 4 variables
13435   830169  830264  a       95      y       16
09433   835620  835672  x       46
30945   838405  838620  a       21      c       19
94853   850475  850660  y       15
04958   865700  865978  c       16      a       98

$ cat input6                                         # 2 additional variables added to OP's original input file
13435   830169  830264  a       95      y       16
09433   835620  835672  x       46      t       375
30945   838405  838620  a       21      c       19
94853   850475  850660  y       15      j       127     t       453
04958   865700  865978  c       16      a       98

运行这些通过awk脚本生成：

############# input4
                        a       c       x       y
13435   830169  830264  95                      16
09433   835620  835672                  46
30945   838405  838620  21      19
94853   850475  850660                          15
04958   865700  865978  98      16

############# input6
                        a       c       j       t       x       y
13435   830169  830264  95                                      16
09433   835620  835672                          375     46
30945   838405  838620  21      19
94853   850475  850660                  127     453             15
04958   865700  865978  98      16

将行中的数据转换为列

Converting data in rows to columns

bash

pivot

pivot-table