使用 AWK 和 SED 操作数据的更好方法

Better way to manipulate data using AWK and SED

我想知道是否有人可以帮助我 re-write 以更明智和更聪明的方式?

sed -e '1d; $d' <someinputfile> |
awk -F"\t" '{split(,a,/-/); print ","a[1]","a[2]","","","","}' |
sed -e "s/,/\",\"/g" |
sed 's/^/"/;s/$/"/' |
sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'

用 awk 已经可以写出正确的输出,我认为有更好的方法来写这个。

更短的路?更有效的方法?更正确的方法? POSIX 合规? GNU 兼容?

如果您能提供帮助,也请尝试解释更改,因为我真的很想了解“如何”和“什么是什么”:)

谢谢!

它的作用是:

  1. 删除第一行和最后一行
  2. 根据分隔符拆分第二个字段并打印(这里应该可以立即打印正确的格式?)
  3. 将之前的 awk 打印中的 , 更改为 ","
  4. 在所有行周围添加 "
  5. 添加一个新的header

如果有人想玩输入文件,这里有一个例子:

START   9       1997-07-27T13:37:01Z
X1      24087-27        Axgma8PYjc1yRJlUr41688  1997-07-27T13:09:00Z    9876    OK
X1      642-68  6nwPtLQTqAAKufH3ejoEeg  1997-07-27T14:31:00Z    9876    OK
X1      642-31  qfKH99UnxZTcp2AN8NNB21  1997-07-27T16:15:00Z    9876    OK
X1      642-24  PouJBByqUJkqhKHBynUesD  1997-07-27T16:15:00Z    9876    OK
X1      642-30  J7t2sJKKtcxWJr18I84A46  1997-07-27T16:15:00Z    9876    OK
X1      642-29  g7hPkNpUywvk6FvGqgpHsx  1997-07-27T16:15:00Z    9876    OK
X1      642-26  W2KM24xvmy0Q8cLV950tXq  1997-07-27T16:15:00Z    9876    OK
X1      642-25  dqu8jB5tUthIKevNAQXgld  1997-07-27T16:15:00Z    9876    OK
X1      753-32  Gh0kZkIJr8j6FSYljbpyyy  1997-07-27T16:15:00Z    9876    OK
X1      753-23  Jvl8LMh6SDHfgvLfJIHi5l  1997-07-27T16:15:00Z    9876    OK
X1      753-28  IZ83996cthjhZGYcAk97iJ  1997-07-27T16:15:00Z    9876    OK
X1      753-22  YJwokU0Dq6xiydkf3EDyxl  1997-07-27T16:15:00Z    9876    OK
X1      753-36  OZHOMirRKjA3LcXTbPJL31  1997-07-27T16:15:00Z    9876    OK
X1      753-34  LvMgT6ed1b1e3uwasGi48G  1997-07-27T16:15:00Z    9877    OK
X1      753-35  VJk4x8sTG1BJTnZYvgu6px  1997-07-27T16:15:00Z    9876    OK
X1      663-27  mkZXgTHKBjmAplrDeoQZXo  1997-07-27T16:15:00Z    9875    ERR
X1              f1K1PzQ9sp2QAv1AX0Zix4  1997-07-27T16:27:00Z    9875    ERR
DONE     69      3QXFXKQAFRSZXJLJ6JZ9NWMXR00B1V1J1FUMBQAA9DQSRCTZF8JXAWWSGHSDIPQ9

谢谢!

PS:因为我不确定您是否会在您的计算机上获得相同的输出,这里是当我 运行 它时它如何正确地为我寻找它以及我想要它的方式:

"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

一个awk想法:

awk '
BEGIN { FS="\t"
        OFS="\",\""                 # define output field delimiter as <doublequote> <comma> <doublequote>

        # print header
        print "\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven\""
      }

FNR>1 { if (prev) print prev
        split(,a,"-")

        # reformat current line and save in variable "prev", to be printed on next pass; add <doublequote> on ends
        prev= "\""  OFS a[1] OFS a[2] OFS  OFS  OFS  OFS  "\""
      }
' input.dat

这会生成:

"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

鉴于:

sed -E 's/\t/\t/g' file
START\t9\t1997-07-27T13:37:01Z
X1\t24087-27\tAxgma8PYjc1yRJlUr41688\t1997-07-27T13:09:00Z\t9876\tOK
X1\t642-68\t6nwPtLQTqAAKufH3ejoEeg\t1997-07-27T14:31:00Z\t9876\tOK
X1\t642-31\tqfKH99UnxZTcp2AN8NNB21\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-24\tPouJBByqUJkqhKHBynUesD\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-30\tJ7t2sJKKtcxWJr18I84A46\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-29\tg7hPkNpUywvk6FvGqgpHsx\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-26\tW2KM24xvmy0Q8cLV950tXq\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-25\tdqu8jB5tUthIKevNAQXgld\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-32\tGh0kZkIJr8j6FSYljbpyyy\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-23\tJvl8LMh6SDHfgvLfJIHi5l\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-28\tIZ83996cthjhZGYcAk97iJ\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-22\tYJwokU0Dq6xiydkf3EDyxl\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-36\tOZHOMirRKjA3LcXTbPJL31\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-34\tLvMgT6ed1b1e3uwasGi48G\t1997-07-27T16:15:00Z\t9877\tOK
X1\t753-35\tVJk4x8sTG1BJTnZYvgu6px\t1997-07-27T16:15:00Z\t9876\tOK
X1\t663-27\tmkZXgTHKBjmAplrDeoQZXo\t1997-07-27T16:15:00Z\t9875\tERR
X1\t\tf1K1PzQ9sp2QAv1AX0Zix4\t1997-07-27T16:27:00Z\t9875\tERR
DONE\t69\t3QXFXKQAFRSZXJLJ6JZ9NWMXR00B1V1J1FUMBQAA9DQSRCTZF8JXAWWSGHSDIPQ9

使用适当的 CSV 解析器来处理此类问题是一个很好的主意。

Ruby 无处不在,发行版中包含一个非常轻巧但功能强大的 CSV 解析器。

这是一个ruby:

ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>"\t"})
d2=CSV::Table.new([], headers:["field_one","field_two","field_three","field_four","field_five","field_six","field_seven"])
data[1...-1].each { |r| 
    r_=[]
    r.each_with_index { |e,i|
        if i == 1 
            e && e[/-/] ? (r_.concat e.split(/-/,2)) : (r_.concat ["",""])
        else
            r_ << e
        end
    }
    d2 << r_ }
puts d2.to_csv(**{:force_quotes=>true})
' file

打印:

"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

我会修改这部分代码

awk -F"\t" '{split(,a,/-/); print ","a[1]","a[2]","","","","}' | sed -e "s/,/\",\"/g" | sed 's/^/"/;s/$/"/' | sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'

按照以下方式,第一步:使用 "," 而不是 , 然后更改它,即

awk -F"\t" '{split(,a,/-/); print "\",\""a[1]"\",\""a[2]"\",\"""\",\"""\",\"""\",\""}' | sed 's/^/"/;s/$/"/' | sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'

第二步:在 print 中添加前导 " 和尾随 "

awk -F"\t" '{split(,a,/-/); print "\"""\",\""a[1]"\",\""a[2]"\",\"""\",\"""\",\"""\",\"""\""}' | sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'

第三步:使用BEGINprintheader即

awk -F"\t" 'BEGIN{print "\"field_one\",\"field_two\",\"field_three\",\"field_four\",\"field_five\",\"field_six\",\"field_seven\""}{split(,a,/-/); print "\"""\",\""a[1]"\",\""a[2]"\",\"""\",\"""\",\"""\",\"""\""}'

(在 gawk 4.2.1 中测试)

没有我希望的那样优雅的解决方案,但它完成了工作 -

而不是字段的口头名称中的 hard-coding,它会根据实际输入简单地动态计算所需的 header 行,这也说明了字段 2[= 的预期拆分11=]

 gnice gcat sample.txt \
                        \
 | mawk2 'function trim(_) { return \
                         \
       substr("",gsub("^[,]*|[,]*$","",_))_ 

  } BEGIN {  FS =   "[ ]+"
            OFS = "" 
  } NR==2 { 
            for(_=NF+!_;_;_—) { 
               ___=(_)(OFS)___
            }
            printf("%*s\n",gsub("[0-9]+[^0-9]+",\
                   "field_&",___)~"",trim(___)) 

  } !/^(START|DONE)/ {
      
     printf("%.0s%s\n",=$(([=10=]=\
             $(sub("[-]"," ",)<""))~""),[=10=]) } ' | lgp3 3 

"field_1","field_2","field_3","field_4","field_5","field_6","field_7"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"

"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"

"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"

"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"

"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"

"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"