使文件看起来不乱
Make a file look not messed up
我有一个看起来乱七八糟的文件:
contig_1 bin.0013 Rhizobium flavum (taxid 1335061)
contig_2 Alphaproteobacteria (taxid 28211)
contig_3 bin.009
contig_4 bin.008 unclassified (taxid 0)
contig_5 bin.001 Fluviicoccus keumensis (taxid 1435465)
contig_12 bin.003
我希望它看起来正确,制表符分隔的列和空的零:
contig_1 bin.0013 Rhizobium flavum (taxid 1335061)
contig_2 0 Alphaproteobacteria (taxid 28211)
contig_3 bin.009 0
contig_4 bin.008 unclassified (taxid 0)
contig_5 bin.001 Fluviicoccus keumensis (taxid 1435465)
contig_12 bin.003 0
如果我像这样使用 sed 's/ /,/g' filename
除了 1-2 和 2-3 列之外的任何地方都插入逗号。
如果 awk
是您的选择,请您尝试以下操作:
awk -v OFS="\t" '
NR==FNR {
# in the 1st pass, detect the starting positions of the 2nd field and the 3rd
sub(" +$", "") # it avoids misdetection due to extra trailing blanks
if (match([=10=], "[^[:blank:]]+[[:blank:]]+")) {
# RLENGTH holds the ending position of the 1st blank
if (col2 == 0 || RLENGTH < col2) col2 = RLENGTH + 1
if (match([=10=], "[^[:blank:]]+[[:blank:]]+[^[:blank:]]+[[:blank:]]+")) {
# RLENGTH holds the ending position of the 2nd blank
if (col3 == 0 || RLENGTH < col3) col3 = RLENGTH + 1
}
}
next
}
{
# in the 2nd pass, extract the substrings in the fixed position and reformat them
# by removing extra spaces and putting "0" if the fiels is empty
c1 = substr([=10=], 1, col2 - 1); sub(" +$", "", c1); if (c1 == "") c1 = "0"
c2 = substr([=10=], col2, col3 - col2); sub(" +$", "", c2); if (c2 == "") c2 = "0"
c3 = substr([=10=], col3); gsub(" +", " ", c3); if (c3 == "") c3 = "0"
# print c1, c2, c3 # use this for the tab-separated output
printf("%-12s%-12s%-s\n", c1, c2, c3)
}' file file
输出:
contig_1 bin.0013 Rhizobium flavum (taxid 1335061)
contig_2 0 Alphaproteobacteria (taxid 28211)
contig_3 bin.009 0
contig_4 bin.008 unclassified (taxid 0)
contig_5 bin.001 Fluviicoccus keumensis (taxid 1435465)
contig_12 bin.003 0
- 该过程由两道工序组成。在第 1 遍中,它检测字段的起始位置。
- 在第 2 遍中,它使用第 1 遍中计算的位置裁剪出各个字段。
- 我选择了
printf
以视觉对齐输出。您可以切换到 tab separated values
取决于偏好。
我有一个看起来乱七八糟的文件:
contig_1 bin.0013 Rhizobium flavum (taxid 1335061)
contig_2 Alphaproteobacteria (taxid 28211)
contig_3 bin.009
contig_4 bin.008 unclassified (taxid 0)
contig_5 bin.001 Fluviicoccus keumensis (taxid 1435465)
contig_12 bin.003
我希望它看起来正确,制表符分隔的列和空的零:
contig_1 bin.0013 Rhizobium flavum (taxid 1335061)
contig_2 0 Alphaproteobacteria (taxid 28211)
contig_3 bin.009 0
contig_4 bin.008 unclassified (taxid 0)
contig_5 bin.001 Fluviicoccus keumensis (taxid 1435465)
contig_12 bin.003 0
如果我像这样使用 sed 's/ /,/g' filename
除了 1-2 和 2-3 列之外的任何地方都插入逗号。
如果 awk
是您的选择,请您尝试以下操作:
awk -v OFS="\t" '
NR==FNR {
# in the 1st pass, detect the starting positions of the 2nd field and the 3rd
sub(" +$", "") # it avoids misdetection due to extra trailing blanks
if (match([=10=], "[^[:blank:]]+[[:blank:]]+")) {
# RLENGTH holds the ending position of the 1st blank
if (col2 == 0 || RLENGTH < col2) col2 = RLENGTH + 1
if (match([=10=], "[^[:blank:]]+[[:blank:]]+[^[:blank:]]+[[:blank:]]+")) {
# RLENGTH holds the ending position of the 2nd blank
if (col3 == 0 || RLENGTH < col3) col3 = RLENGTH + 1
}
}
next
}
{
# in the 2nd pass, extract the substrings in the fixed position and reformat them
# by removing extra spaces and putting "0" if the fiels is empty
c1 = substr([=10=], 1, col2 - 1); sub(" +$", "", c1); if (c1 == "") c1 = "0"
c2 = substr([=10=], col2, col3 - col2); sub(" +$", "", c2); if (c2 == "") c2 = "0"
c3 = substr([=10=], col3); gsub(" +", " ", c3); if (c3 == "") c3 = "0"
# print c1, c2, c3 # use this for the tab-separated output
printf("%-12s%-12s%-s\n", c1, c2, c3)
}' file file
输出:
contig_1 bin.0013 Rhizobium flavum (taxid 1335061)
contig_2 0 Alphaproteobacteria (taxid 28211)
contig_3 bin.009 0
contig_4 bin.008 unclassified (taxid 0)
contig_5 bin.001 Fluviicoccus keumensis (taxid 1435465)
contig_12 bin.003 0
- 该过程由两道工序组成。在第 1 遍中,它检测字段的起始位置。
- 在第 2 遍中,它使用第 1 遍中计算的位置裁剪出各个字段。
- 我选择了
printf
以视觉对齐输出。您可以切换到tab separated values
取决于偏好。