文本处理忽略第二次出现的下划线

text processing ignore 2nd occurence of underscore

第 2 次出现下划线的数据应该被忽略,这应该被排序并且需要消除重复。

awk -F_ '{print }' file1 >> file 2; sort file1 | uniq ; i tried

******来自********

GGGGGGG             DDDDD   --> header
XYSER_YURTZ     SUMOT_2_058A     
XYSER_YURTZ     SUMOT_2_058B    
XYSER_YURTZ     HJRIT_6_51A     
XYSER_YURTZ     HJRIT_6_51B     
XYSER_YURTZ     HJRIT_6_51C    
XYSER_YURTZ     HJRIT_6_51D    
XYSER_YURTZ     HJRIT_6_51E    
XYSER_YURTZ     HJRIT_6_51F    
XYSER_YURTZ     HJRIT_6_520    
XYSER_YURTZ     HJRIT_6_521    
XYSER_GFRE      SUMOT_2_16C3    
XYSER_GFRE      SUMOT_2_16C4    
XYSER_GFRE      SUMOT_2_16C5    
XYSER_GFRE      SUMOT_2_16C6  
XYSER_GFRE      SUMOT_2_16C7  
XYSER_GFRE      SUMOT_2_16C8  
XYSER_GFRE      SUMOT_2_16C9  
XYSER_GFRE      SUMOT_2_16CA  
XYSER_GFRE      SUMOT_2_16CB  
XYSER_GFRE      SUMOT_2_16CC   
XYSER_GFRE      SUMOT_2_16CD  
XYSER_GFRE      SUMOT_2_16CE   
XYSER_GFRE      SUMOT_2_16CF  
XYSER_GFRE      SUMOT_2_16D0  
XYSER_GFRE      SUMOT_2_16D1  
XYSER_GFRE      SUMOT_2_16D2  
XYSER_GFRE      SUMOT_2_16D3  
XYSER_GFRE      SUMOT_2_16D4  
XYSER_GFRE      HJRIT_6_12E1    
XYSER_GFRE      HJRIT_6_12E2    
XYSER_GFRE      HJRIT_6_12E3    
XYSER_GFRE      HJRIT_6_12E4    
XYSER_GFRE      HJRIT_6_12E5   
XYSER_GFRE      HJRIT_6_12E6   
XYSER_GFRE      HJRIT_6_12E7   
XYSER_GFRE      HJRIT_6_12E8   
XYSER_GFRE      HJRIT_6_12E9   
XYSER_GFRE      HJRIT_6_12EA   
XYSER_GFRE      HJRIT_6_12EB   
XYSER_GFRE      HJRIT_6_12EC   
XYSER_GFRE      HJRIT_6_12ED   
XYSER_ALY1      XYSER_ALY1_0000   
XYSER_ALY       SUMOT_2_0497   
XYSER_ALY       SUMOT_2_0498   
XYSER_BAP01     SUMOT_2_020E 

**************OUTPUT1**************

GGGGGGG DDDDD   
XYSER_YURTZ SUMOT_2   
XYSER_YURTZ HJRIT_6   
XYSER_GFRE SUMOT_2   
XYSER_GFRE HJRIT_6   
XYSER_ALY1 XYSER_ALY1   
XYSER_ALY SUMOT_2       
XYSER_BAP01 SUMOT_2   
XYSER_BAP02 SUMOT_2   

**************OUTPUT2**************

DDDDD GGGGGGG   
SUMOT_2 XYSER_YURTZ  
SUMOT_2 XYSER_GFRE  
SUMOT_2 XYSER_ALY  
SUMOT_2 XYSER_BAP01  
SUMOT_2 XYSER_BAP02  
HJRIT_6 XYSER_YURTZ  
HJRIT_6 XYSER_GFRE  
XYSER_ALY1 XYSER_ALY1  

根据您的示例输入,您可以使用

sed 's/_[^_]*$//' inputfile|sort|uniq

这将删除最后一个下划线和所有后续字符。

注意:sort 命令可能会将 header 放在其他行之间,因为它会按字母数字对整个数据进行排序。在您的示例中,这不是问题,因为 header 行 GGGGGGG... 将排在 XYSER_....

之前

如果您知道相似的行已经在您的输入文件中分组,您可以省略排序并使用

sed 's/_[^_]*$//' inputfile|uniq