文本处理忽略第二次出现的下划线
text processing ignore 2nd occurence of underscore
第 2 次出现下划线的数据应该被忽略,这应该被排序并且需要消除重复。
awk -F_ '{print }' file1 >> file 2; sort file1 | uniq ; i tried
******来自********
GGGGGGG DDDDD --> header
XYSER_YURTZ SUMOT_2_058A
XYSER_YURTZ SUMOT_2_058B
XYSER_YURTZ HJRIT_6_51A
XYSER_YURTZ HJRIT_6_51B
XYSER_YURTZ HJRIT_6_51C
XYSER_YURTZ HJRIT_6_51D
XYSER_YURTZ HJRIT_6_51E
XYSER_YURTZ HJRIT_6_51F
XYSER_YURTZ HJRIT_6_520
XYSER_YURTZ HJRIT_6_521
XYSER_GFRE SUMOT_2_16C3
XYSER_GFRE SUMOT_2_16C4
XYSER_GFRE SUMOT_2_16C5
XYSER_GFRE SUMOT_2_16C6
XYSER_GFRE SUMOT_2_16C7
XYSER_GFRE SUMOT_2_16C8
XYSER_GFRE SUMOT_2_16C9
XYSER_GFRE SUMOT_2_16CA
XYSER_GFRE SUMOT_2_16CB
XYSER_GFRE SUMOT_2_16CC
XYSER_GFRE SUMOT_2_16CD
XYSER_GFRE SUMOT_2_16CE
XYSER_GFRE SUMOT_2_16CF
XYSER_GFRE SUMOT_2_16D0
XYSER_GFRE SUMOT_2_16D1
XYSER_GFRE SUMOT_2_16D2
XYSER_GFRE SUMOT_2_16D3
XYSER_GFRE SUMOT_2_16D4
XYSER_GFRE HJRIT_6_12E1
XYSER_GFRE HJRIT_6_12E2
XYSER_GFRE HJRIT_6_12E3
XYSER_GFRE HJRIT_6_12E4
XYSER_GFRE HJRIT_6_12E5
XYSER_GFRE HJRIT_6_12E6
XYSER_GFRE HJRIT_6_12E7
XYSER_GFRE HJRIT_6_12E8
XYSER_GFRE HJRIT_6_12E9
XYSER_GFRE HJRIT_6_12EA
XYSER_GFRE HJRIT_6_12EB
XYSER_GFRE HJRIT_6_12EC
XYSER_GFRE HJRIT_6_12ED
XYSER_ALY1 XYSER_ALY1_0000
XYSER_ALY SUMOT_2_0497
XYSER_ALY SUMOT_2_0498
XYSER_BAP01 SUMOT_2_020E
到
**************OUTPUT1**************
GGGGGGG DDDDD
XYSER_YURTZ SUMOT_2
XYSER_YURTZ HJRIT_6
XYSER_GFRE SUMOT_2
XYSER_GFRE HJRIT_6
XYSER_ALY1 XYSER_ALY1
XYSER_ALY SUMOT_2
XYSER_BAP01 SUMOT_2
XYSER_BAP02 SUMOT_2
**************OUTPUT2**************
DDDDD GGGGGGG
SUMOT_2 XYSER_YURTZ
SUMOT_2 XYSER_GFRE
SUMOT_2 XYSER_ALY
SUMOT_2 XYSER_BAP01
SUMOT_2 XYSER_BAP02
HJRIT_6 XYSER_YURTZ
HJRIT_6 XYSER_GFRE
XYSER_ALY1 XYSER_ALY1
根据您的示例输入,您可以使用
sed 's/_[^_]*$//' inputfile|sort|uniq
这将删除最后一个下划线和所有后续字符。
注意:sort
命令可能会将 header 放在其他行之间,因为它会按字母数字对整个数据进行排序。在您的示例中,这不是问题,因为 header 行 GGGGGGG...
将排在 XYSER_...
.
之前
如果您知道相似的行已经在您的输入文件中分组,您可以省略排序并使用
sed 's/_[^_]*$//' inputfile|uniq
第 2 次出现下划线的数据应该被忽略,这应该被排序并且需要消除重复。
awk -F_ '{print }' file1 >> file 2; sort file1 | uniq ; i tried
******来自********
GGGGGGG DDDDD --> header
XYSER_YURTZ SUMOT_2_058A
XYSER_YURTZ SUMOT_2_058B
XYSER_YURTZ HJRIT_6_51A
XYSER_YURTZ HJRIT_6_51B
XYSER_YURTZ HJRIT_6_51C
XYSER_YURTZ HJRIT_6_51D
XYSER_YURTZ HJRIT_6_51E
XYSER_YURTZ HJRIT_6_51F
XYSER_YURTZ HJRIT_6_520
XYSER_YURTZ HJRIT_6_521
XYSER_GFRE SUMOT_2_16C3
XYSER_GFRE SUMOT_2_16C4
XYSER_GFRE SUMOT_2_16C5
XYSER_GFRE SUMOT_2_16C6
XYSER_GFRE SUMOT_2_16C7
XYSER_GFRE SUMOT_2_16C8
XYSER_GFRE SUMOT_2_16C9
XYSER_GFRE SUMOT_2_16CA
XYSER_GFRE SUMOT_2_16CB
XYSER_GFRE SUMOT_2_16CC
XYSER_GFRE SUMOT_2_16CD
XYSER_GFRE SUMOT_2_16CE
XYSER_GFRE SUMOT_2_16CF
XYSER_GFRE SUMOT_2_16D0
XYSER_GFRE SUMOT_2_16D1
XYSER_GFRE SUMOT_2_16D2
XYSER_GFRE SUMOT_2_16D3
XYSER_GFRE SUMOT_2_16D4
XYSER_GFRE HJRIT_6_12E1
XYSER_GFRE HJRIT_6_12E2
XYSER_GFRE HJRIT_6_12E3
XYSER_GFRE HJRIT_6_12E4
XYSER_GFRE HJRIT_6_12E5
XYSER_GFRE HJRIT_6_12E6
XYSER_GFRE HJRIT_6_12E7
XYSER_GFRE HJRIT_6_12E8
XYSER_GFRE HJRIT_6_12E9
XYSER_GFRE HJRIT_6_12EA
XYSER_GFRE HJRIT_6_12EB
XYSER_GFRE HJRIT_6_12EC
XYSER_GFRE HJRIT_6_12ED
XYSER_ALY1 XYSER_ALY1_0000
XYSER_ALY SUMOT_2_0497
XYSER_ALY SUMOT_2_0498
XYSER_BAP01 SUMOT_2_020E
到
**************OUTPUT1**************
GGGGGGG DDDDD
XYSER_YURTZ SUMOT_2
XYSER_YURTZ HJRIT_6
XYSER_GFRE SUMOT_2
XYSER_GFRE HJRIT_6
XYSER_ALY1 XYSER_ALY1
XYSER_ALY SUMOT_2
XYSER_BAP01 SUMOT_2
XYSER_BAP02 SUMOT_2
**************OUTPUT2**************
DDDDD GGGGGGG
SUMOT_2 XYSER_YURTZ
SUMOT_2 XYSER_GFRE
SUMOT_2 XYSER_ALY
SUMOT_2 XYSER_BAP01
SUMOT_2 XYSER_BAP02
HJRIT_6 XYSER_YURTZ
HJRIT_6 XYSER_GFRE
XYSER_ALY1 XYSER_ALY1
根据您的示例输入,您可以使用
sed 's/_[^_]*$//' inputfile|sort|uniq
这将删除最后一个下划线和所有后续字符。
注意:sort
命令可能会将 header 放在其他行之间,因为它会按字母数字对整个数据进行排序。在您的示例中,这不是问题,因为 header 行 GGGGGGG...
将排在 XYSER_...
.
如果您知道相似的行已经在您的输入文件中分组,您可以省略排序并使用
sed 's/_[^_]*$//' inputfile|uniq