使用 AWK 和 SED 操作数据的更好方法
Better way to manipulate data using AWK and SED
我想知道是否有人可以帮助我 re-write 以更明智和更聪明的方式?
sed -e '1d; $d' <someinputfile> |
awk -F"\t" '{split(,a,/-/); print ","a[1]","a[2]","","","","}' |
sed -e "s/,/\",\"/g" |
sed 's/^/"/;s/$/"/' |
sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
用 awk 已经可以写出正确的输出,我认为有更好的方法来写这个。
更短的路?更有效的方法?更正确的方法? POSIX 合规? GNU 兼容?
如果您能提供帮助,也请尝试解释更改,因为我真的很想了解“如何”和“什么是什么”:)
谢谢!
它的作用是:
- 删除第一行和最后一行
- 根据分隔符拆分第二个字段并打印(这里应该可以立即打印正确的格式?)
- 将之前的 awk 打印中的 , 更改为 ","
- 在所有行周围添加 "
- 添加一个新的header
如果有人想玩输入文件,这里有一个例子:
START 9 1997-07-27T13:37:01Z
X1 24087-27 Axgma8PYjc1yRJlUr41688 1997-07-27T13:09:00Z 9876 OK
X1 642-68 6nwPtLQTqAAKufH3ejoEeg 1997-07-27T14:31:00Z 9876 OK
X1 642-31 qfKH99UnxZTcp2AN8NNB21 1997-07-27T16:15:00Z 9876 OK
X1 642-24 PouJBByqUJkqhKHBynUesD 1997-07-27T16:15:00Z 9876 OK
X1 642-30 J7t2sJKKtcxWJr18I84A46 1997-07-27T16:15:00Z 9876 OK
X1 642-29 g7hPkNpUywvk6FvGqgpHsx 1997-07-27T16:15:00Z 9876 OK
X1 642-26 W2KM24xvmy0Q8cLV950tXq 1997-07-27T16:15:00Z 9876 OK
X1 642-25 dqu8jB5tUthIKevNAQXgld 1997-07-27T16:15:00Z 9876 OK
X1 753-32 Gh0kZkIJr8j6FSYljbpyyy 1997-07-27T16:15:00Z 9876 OK
X1 753-23 Jvl8LMh6SDHfgvLfJIHi5l 1997-07-27T16:15:00Z 9876 OK
X1 753-28 IZ83996cthjhZGYcAk97iJ 1997-07-27T16:15:00Z 9876 OK
X1 753-22 YJwokU0Dq6xiydkf3EDyxl 1997-07-27T16:15:00Z 9876 OK
X1 753-36 OZHOMirRKjA3LcXTbPJL31 1997-07-27T16:15:00Z 9876 OK
X1 753-34 LvMgT6ed1b1e3uwasGi48G 1997-07-27T16:15:00Z 9877 OK
X1 753-35 VJk4x8sTG1BJTnZYvgu6px 1997-07-27T16:15:00Z 9876 OK
X1 663-27 mkZXgTHKBjmAplrDeoQZXo 1997-07-27T16:15:00Z 9875 ERR
X1 f1K1PzQ9sp2QAv1AX0Zix4 1997-07-27T16:27:00Z 9875 ERR
DONE 69 3QXFXKQAFRSZXJLJ6JZ9NWMXR00B1V1J1FUMBQAA9DQSRCTZF8JXAWWSGHSDIPQ9
谢谢!
PS:因为我不确定您是否会在您的计算机上获得相同的输出,这里是当我 运行 它时它如何正确地为我寻找它以及我想要它的方式:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"
一个awk
想法:
awk '
BEGIN { FS="\t"
OFS="\",\"" # define output field delimiter as <doublequote> <comma> <doublequote>
# print header
print "\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven\""
}
FNR>1 { if (prev) print prev
split(,a,"-")
# reformat current line and save in variable "prev", to be printed on next pass; add <doublequote> on ends
prev= "\"" OFS a[1] OFS a[2] OFS OFS OFS OFS "\""
}
' input.dat
这会生成:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"
鉴于:
sed -E 's/\t/\t/g' file
START\t9\t1997-07-27T13:37:01Z
X1\t24087-27\tAxgma8PYjc1yRJlUr41688\t1997-07-27T13:09:00Z\t9876\tOK
X1\t642-68\t6nwPtLQTqAAKufH3ejoEeg\t1997-07-27T14:31:00Z\t9876\tOK
X1\t642-31\tqfKH99UnxZTcp2AN8NNB21\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-24\tPouJBByqUJkqhKHBynUesD\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-30\tJ7t2sJKKtcxWJr18I84A46\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-29\tg7hPkNpUywvk6FvGqgpHsx\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-26\tW2KM24xvmy0Q8cLV950tXq\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-25\tdqu8jB5tUthIKevNAQXgld\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-32\tGh0kZkIJr8j6FSYljbpyyy\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-23\tJvl8LMh6SDHfgvLfJIHi5l\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-28\tIZ83996cthjhZGYcAk97iJ\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-22\tYJwokU0Dq6xiydkf3EDyxl\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-36\tOZHOMirRKjA3LcXTbPJL31\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-34\tLvMgT6ed1b1e3uwasGi48G\t1997-07-27T16:15:00Z\t9877\tOK
X1\t753-35\tVJk4x8sTG1BJTnZYvgu6px\t1997-07-27T16:15:00Z\t9876\tOK
X1\t663-27\tmkZXgTHKBjmAplrDeoQZXo\t1997-07-27T16:15:00Z\t9875\tERR
X1\t\tf1K1PzQ9sp2QAv1AX0Zix4\t1997-07-27T16:27:00Z\t9875\tERR
DONE\t69\t3QXFXKQAFRSZXJLJ6JZ9NWMXR00B1V1J1FUMBQAA9DQSRCTZF8JXAWWSGHSDIPQ9
使用适当的 CSV 解析器来处理此类问题是一个很好的主意。
Ruby 无处不在,发行版中包含一个非常轻巧但功能强大的 CSV 解析器。
这是一个ruby:
ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>"\t"})
d2=CSV::Table.new([], headers:["field_one","field_two","field_three","field_four","field_five","field_six","field_seven"])
data[1...-1].each { |r|
r_=[]
r.each_with_index { |e,i|
if i == 1
e && e[/-/] ? (r_.concat e.split(/-/,2)) : (r_.concat ["",""])
else
r_ << e
end
}
d2 << r_ }
puts d2.to_csv(**{:force_quotes=>true})
' file
打印:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"
我会修改这部分代码
awk -F"\t" '{split(,a,/-/); print ","a[1]","a[2]","","","","}' | sed -e "s/,/\",\"/g" | sed 's/^/"/;s/$/"/' | sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
按照以下方式,第一步:使用 ","
而不是 ,
然后更改它,即
awk -F"\t" '{split(,a,/-/); print "\",\""a[1]"\",\""a[2]"\",\"""\",\"""\",\"""\",\""}' | sed 's/^/"/;s/$/"/' | sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
第二步:在 print
中添加前导 "
和尾随 "
即
awk -F"\t" '{split(,a,/-/); print "\"""\",\""a[1]"\",\""a[2]"\",\"""\",\"""\",\"""\",\"""\""}' | sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
第三步:使用BEGIN
到print
header即
awk -F"\t" 'BEGIN{print "\"field_one\",\"field_two\",\"field_three\",\"field_four\",\"field_five\",\"field_six\",\"field_seven\""}{split(,a,/-/); print "\"""\",\""a[1]"\",\""a[2]"\",\"""\",\"""\",\"""\",\"""\""}'
(在 gawk 4.2.1 中测试)
没有我希望的那样优雅的解决方案,但它完成了工作 -
而不是字段的口头名称中的 hard-coding,它会根据实际输入简单地动态计算所需的 header 行,这也说明了字段 2[= 的预期拆分11=]
gnice gcat sample.txt \
\
| mawk2 'function trim(_) { return \
\
substr("",gsub("^[,]*|[,]*$","",_))_
} BEGIN { FS = "[ ]+"
OFS = ""
} NR==2 {
for(_=NF+!_;_;_—) {
___=(_)(OFS)___
}
printf("%*s\n",gsub("[0-9]+[^0-9]+",\
"field_&",___)~"",trim(___))
} !/^(START|DONE)/ {
printf("%.0s%s\n",=$(([=10=]=\
$(sub("[-]"," ",)<""))~""),[=10=]) } ' | lgp3 3
"field_1","field_2","field_3","field_4","field_5","field_6","field_7"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"
我想知道是否有人可以帮助我 re-write 以更明智和更聪明的方式?
sed -e '1d; $d' <someinputfile> |
awk -F"\t" '{split(,a,/-/); print ","a[1]","a[2]","","","","}' |
sed -e "s/,/\",\"/g" |
sed 's/^/"/;s/$/"/' |
sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
用 awk 已经可以写出正确的输出,我认为有更好的方法来写这个。
更短的路?更有效的方法?更正确的方法? POSIX 合规? GNU 兼容?
如果您能提供帮助,也请尝试解释更改,因为我真的很想了解“如何”和“什么是什么”:)
谢谢!
它的作用是:
- 删除第一行和最后一行
- 根据分隔符拆分第二个字段并打印(这里应该可以立即打印正确的格式?)
- 将之前的 awk 打印中的 , 更改为 ","
- 在所有行周围添加 "
- 添加一个新的header
如果有人想玩输入文件,这里有一个例子:
START 9 1997-07-27T13:37:01Z
X1 24087-27 Axgma8PYjc1yRJlUr41688 1997-07-27T13:09:00Z 9876 OK
X1 642-68 6nwPtLQTqAAKufH3ejoEeg 1997-07-27T14:31:00Z 9876 OK
X1 642-31 qfKH99UnxZTcp2AN8NNB21 1997-07-27T16:15:00Z 9876 OK
X1 642-24 PouJBByqUJkqhKHBynUesD 1997-07-27T16:15:00Z 9876 OK
X1 642-30 J7t2sJKKtcxWJr18I84A46 1997-07-27T16:15:00Z 9876 OK
X1 642-29 g7hPkNpUywvk6FvGqgpHsx 1997-07-27T16:15:00Z 9876 OK
X1 642-26 W2KM24xvmy0Q8cLV950tXq 1997-07-27T16:15:00Z 9876 OK
X1 642-25 dqu8jB5tUthIKevNAQXgld 1997-07-27T16:15:00Z 9876 OK
X1 753-32 Gh0kZkIJr8j6FSYljbpyyy 1997-07-27T16:15:00Z 9876 OK
X1 753-23 Jvl8LMh6SDHfgvLfJIHi5l 1997-07-27T16:15:00Z 9876 OK
X1 753-28 IZ83996cthjhZGYcAk97iJ 1997-07-27T16:15:00Z 9876 OK
X1 753-22 YJwokU0Dq6xiydkf3EDyxl 1997-07-27T16:15:00Z 9876 OK
X1 753-36 OZHOMirRKjA3LcXTbPJL31 1997-07-27T16:15:00Z 9876 OK
X1 753-34 LvMgT6ed1b1e3uwasGi48G 1997-07-27T16:15:00Z 9877 OK
X1 753-35 VJk4x8sTG1BJTnZYvgu6px 1997-07-27T16:15:00Z 9876 OK
X1 663-27 mkZXgTHKBjmAplrDeoQZXo 1997-07-27T16:15:00Z 9875 ERR
X1 f1K1PzQ9sp2QAv1AX0Zix4 1997-07-27T16:27:00Z 9875 ERR
DONE 69 3QXFXKQAFRSZXJLJ6JZ9NWMXR00B1V1J1FUMBQAA9DQSRCTZF8JXAWWSGHSDIPQ9
谢谢!
PS:因为我不确定您是否会在您的计算机上获得相同的输出,这里是当我 运行 它时它如何正确地为我寻找它以及我想要它的方式:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"
一个awk
想法:
awk '
BEGIN { FS="\t"
OFS="\",\"" # define output field delimiter as <doublequote> <comma> <doublequote>
# print header
print "\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven\""
}
FNR>1 { if (prev) print prev
split(,a,"-")
# reformat current line and save in variable "prev", to be printed on next pass; add <doublequote> on ends
prev= "\"" OFS a[1] OFS a[2] OFS OFS OFS OFS "\""
}
' input.dat
这会生成:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"
鉴于:
sed -E 's/\t/\t/g' file
START\t9\t1997-07-27T13:37:01Z
X1\t24087-27\tAxgma8PYjc1yRJlUr41688\t1997-07-27T13:09:00Z\t9876\tOK
X1\t642-68\t6nwPtLQTqAAKufH3ejoEeg\t1997-07-27T14:31:00Z\t9876\tOK
X1\t642-31\tqfKH99UnxZTcp2AN8NNB21\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-24\tPouJBByqUJkqhKHBynUesD\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-30\tJ7t2sJKKtcxWJr18I84A46\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-29\tg7hPkNpUywvk6FvGqgpHsx\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-26\tW2KM24xvmy0Q8cLV950tXq\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-25\tdqu8jB5tUthIKevNAQXgld\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-32\tGh0kZkIJr8j6FSYljbpyyy\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-23\tJvl8LMh6SDHfgvLfJIHi5l\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-28\tIZ83996cthjhZGYcAk97iJ\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-22\tYJwokU0Dq6xiydkf3EDyxl\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-36\tOZHOMirRKjA3LcXTbPJL31\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-34\tLvMgT6ed1b1e3uwasGi48G\t1997-07-27T16:15:00Z\t9877\tOK
X1\t753-35\tVJk4x8sTG1BJTnZYvgu6px\t1997-07-27T16:15:00Z\t9876\tOK
X1\t663-27\tmkZXgTHKBjmAplrDeoQZXo\t1997-07-27T16:15:00Z\t9875\tERR
X1\t\tf1K1PzQ9sp2QAv1AX0Zix4\t1997-07-27T16:27:00Z\t9875\tERR
DONE\t69\t3QXFXKQAFRSZXJLJ6JZ9NWMXR00B1V1J1FUMBQAA9DQSRCTZF8JXAWWSGHSDIPQ9
使用适当的 CSV 解析器来处理此类问题是一个很好的主意。
Ruby 无处不在,发行版中包含一个非常轻巧但功能强大的 CSV 解析器。
这是一个ruby:
ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>"\t"})
d2=CSV::Table.new([], headers:["field_one","field_two","field_three","field_four","field_five","field_six","field_seven"])
data[1...-1].each { |r|
r_=[]
r.each_with_index { |e,i|
if i == 1
e && e[/-/] ? (r_.concat e.split(/-/,2)) : (r_.concat ["",""])
else
r_ << e
end
}
d2 << r_ }
puts d2.to_csv(**{:force_quotes=>true})
' file
打印:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"
我会修改这部分代码
awk -F"\t" '{split(,a,/-/); print ","a[1]","a[2]","","","","}' | sed -e "s/,/\",\"/g" | sed 's/^/"/;s/$/"/' | sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
按照以下方式,第一步:使用 ","
而不是 ,
然后更改它,即
awk -F"\t" '{split(,a,/-/); print "\",\""a[1]"\",\""a[2]"\",\"""\",\"""\",\"""\",\""}' | sed 's/^/"/;s/$/"/' | sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
第二步:在 print
中添加前导 "
和尾随 "
即
awk -F"\t" '{split(,a,/-/); print "\"""\",\""a[1]"\",\""a[2]"\",\"""\",\"""\",\"""\",\"""\""}' | sed -e $'1i\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
第三步:使用BEGIN
到print
header即
awk -F"\t" 'BEGIN{print "\"field_one\",\"field_two\",\"field_three\",\"field_four\",\"field_five\",\"field_six\",\"field_seven\""}{split(,a,/-/); print "\"""\",\""a[1]"\",\""a[2]"\",\"""\",\"""\",\"""\",\"""\""}'
(在 gawk 4.2.1 中测试)
没有我希望的那样优雅的解决方案,但它完成了工作 -
而不是字段的口头名称中的 hard-coding,它会根据实际输入简单地动态计算所需的 header 行,这也说明了字段 2[= 的预期拆分11=]
gnice gcat sample.txt \
\
| mawk2 'function trim(_) { return \
\
substr("",gsub("^[,]*|[,]*$","",_))_
} BEGIN { FS = "[ ]+"
OFS = ""
} NR==2 {
for(_=NF+!_;_;_—) {
___=(_)(OFS)___
}
printf("%*s\n",gsub("[0-9]+[^0-9]+",\
"field_&",___)~"",trim(___))
} !/^(START|DONE)/ {
printf("%.0s%s\n",=$(([=10=]=\
$(sub("[-]"," ",)<""))~""),[=10=]) } ' | lgp3 3
"field_1","field_2","field_3","field_4","field_5","field_6","field_7"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"