UNIX:如何从右侧剪切某些并非所有字段长度相同的列
UNIX: How to cut columns from the right where some not all fields are the same length
我有一个数据列表,我需要从某些列中删除某些字符。
这是列表:
JCG2380 GREEN, JULIE C JR-II BISS CPSC BS INFO TECH XXX/XXX-9445
JAG1936 GREEN, JOE A. SO-I BISS CPSC BS INFO TECH XXX/XXX-7993
ACG4636 GREEN, ADAM C. JR-II BISS CPSC BS COMP SCI XXX/XXX-0437
SPG1696 GREEN, SEAN P. JR-I BISS CPSC BS COMP SCI XXX/XXX-2398
SEG8835 GREEN, SHAWN E. FR-II BISS CPSC BS COMP SCI XXX/XXX-7149
MCGo599 GREEN, MICHAEL C. JR-I BISS CPSC BS COMP SCI XXX/XXX-OOOO
GJG1887 GREEN, GREGORY J. SO-II BISS CPSC BS INFO TECH XXX/XXX-4354
NGG5479 GREEN, NICHOLAS G JR-I BISS CPSC BS INFO TECH XXX/XXX-8268
ZTG7190 GREEN, ZACHARY T. FR-II BISS CPSC BS INFO TECH XXX/XXX-1298
AXG9097 GREEN, ALEXANDER SO-I BISS CPSC BS INFO TECH XXX/XXX-0313
RJG6624 GREEN, ROBERT J. SO-II BISS CPSC BS COMP SCI XXX/XXX-ZOZI
MWG1990 GREEN, MATTHEW W SO-II BISS CPSC BS INFO TECH XXX/XXX-0581
这里的问题是并非所有字段的大小都相同。请注意亚历山大·格林(倒数第三位)没有中间名的首字母。这使我无法在每一列上统一使用 awk。我的解决方案是剪切文件右侧的所有内容,这样字段分隔符就不会弄乱所有内容。
那么如何使用剪切命令从最右边的列开始并减少7列?
您可以使用剪切,因为您的数据具有固定宽度的字段。
这是我用 ocr 文本得到的结果:
$ cut -c 33-51,73-77 input
JR-II BISS CPSC BS 9445
SO-I BISS CPSC BS 7993
JR-II BISS CPSC BS 0437
JR-I BISS CPSC BS 2398
FR-II BISS CPSC BS 7149
JR-I BISS CPSC BS OOOO
SO-II BISS CPSC BS 4354
JR-I BISS CPSC BS 8268
FR-II BISS CPSC BS 1298
SO-I BISS CPSC BS 0313
SO-II BISS CPSC BS ZOZI
SO-II BISS CPSC BS 0581
并匹配您在评论中写的要求:
Exactly what I'm trying to do is get the first character out of the
columns that start (from the top entry) with JR, BISS, CPSC, INFO.
Then I need the last 4 digits from the phone numbers on the right.
$ cut -c 32-33,38-39,43-44,48-49,64-64,73-77 input
J B C B 9445
S B C B 7993
J B C B 0437
J B C B 2398
F B C B 7149
J B C B OOOO
S B C B 4354
J B C B 8268
F B C B 1298
S B C B 0313
S B C B ZOZI
S B C B 0581
您需要调整实际数据的范围。
以下符合我的理解要求,除了我使用制表符作为输出的字段分隔符以方便您进行调整:
awk 'BEGIN {OFS="\t"} {
# Each line is assumed to have a variable number
# of name fields plus 8 other tokens:
nnames = NF-8;
# from the right:
tel=$NF;
subject2=$(NF-1);
subject1=$(NF-2);
bs=$(NF-3); cpsc=$(NF-4); biss=$(NF-5); data=$(NF-6);
name=;
for (i=2; i<=nnames;i++) {name=name " " $(i+1)}
# Adjustments
data=substr(data,2); biss=substr(biss,2); cpsc=substr(cpsc,2);
subject1=substr(subject1,2)
sub( /[^-]*-/,"", tel);
print , name, data, biss, cpsc, bs, subject1 " " subject2, tel;
}'
输出:
JCG2380 GREEN, JULIE C R-II ISS PSC BS NFO TECH 9445
JAG1936 GREEN, JOE A. O-I ISS PSC BS NFO TECH 7993
ACG4636 GREEN, ADAM C. R-II ISS PSC BS OMP SCI 0437
SPG1696 GREEN, SEAN P. R-I ISS PSC BS OMP SCI 2398
SEG8835 GREEN, SHAWN E. R-II ISS PSC BS OMP SCI 7149
MCGo599 GREEN, MICHAEL C. R-I ISS PSC BS OMP SCI OOOO
GJG1887 GREEN, GREGORY J. O-II ISS PSC BS NFO TECH 4354
NGG5479 GREEN, NICHOLAS G R-I ISS PSC BS NFO TECH 8268
ZTG7190 GREEN, ZACHARY T. R-II ISS PSC BS NFO TECH 1298
AXG9097 GREEN, ALEXANDER O-I ISS PSC BS NFO TECH 0313
RJG6624 GREEN, ROBERT J. O-II ISS PSC BS OMP SCI ZOZI
MWG1990 GREEN, MATTHEW W O-II ISS PSC BS NFO TECH 0581
我有一个数据列表,我需要从某些列中删除某些字符。
这是列表:
JCG2380 GREEN, JULIE C JR-II BISS CPSC BS INFO TECH XXX/XXX-9445
JAG1936 GREEN, JOE A. SO-I BISS CPSC BS INFO TECH XXX/XXX-7993
ACG4636 GREEN, ADAM C. JR-II BISS CPSC BS COMP SCI XXX/XXX-0437
SPG1696 GREEN, SEAN P. JR-I BISS CPSC BS COMP SCI XXX/XXX-2398
SEG8835 GREEN, SHAWN E. FR-II BISS CPSC BS COMP SCI XXX/XXX-7149
MCGo599 GREEN, MICHAEL C. JR-I BISS CPSC BS COMP SCI XXX/XXX-OOOO
GJG1887 GREEN, GREGORY J. SO-II BISS CPSC BS INFO TECH XXX/XXX-4354
NGG5479 GREEN, NICHOLAS G JR-I BISS CPSC BS INFO TECH XXX/XXX-8268
ZTG7190 GREEN, ZACHARY T. FR-II BISS CPSC BS INFO TECH XXX/XXX-1298
AXG9097 GREEN, ALEXANDER SO-I BISS CPSC BS INFO TECH XXX/XXX-0313
RJG6624 GREEN, ROBERT J. SO-II BISS CPSC BS COMP SCI XXX/XXX-ZOZI
MWG1990 GREEN, MATTHEW W SO-II BISS CPSC BS INFO TECH XXX/XXX-0581
这里的问题是并非所有字段的大小都相同。请注意亚历山大·格林(倒数第三位)没有中间名的首字母。这使我无法在每一列上统一使用 awk。我的解决方案是剪切文件右侧的所有内容,这样字段分隔符就不会弄乱所有内容。
那么如何使用剪切命令从最右边的列开始并减少7列?
您可以使用剪切,因为您的数据具有固定宽度的字段。
这是我用 ocr 文本得到的结果:
$ cut -c 33-51,73-77 input
JR-II BISS CPSC BS 9445
SO-I BISS CPSC BS 7993
JR-II BISS CPSC BS 0437
JR-I BISS CPSC BS 2398
FR-II BISS CPSC BS 7149
JR-I BISS CPSC BS OOOO
SO-II BISS CPSC BS 4354
JR-I BISS CPSC BS 8268
FR-II BISS CPSC BS 1298
SO-I BISS CPSC BS 0313
SO-II BISS CPSC BS ZOZI
SO-II BISS CPSC BS 0581
并匹配您在评论中写的要求:
Exactly what I'm trying to do is get the first character out of the columns that start (from the top entry) with JR, BISS, CPSC, INFO. Then I need the last 4 digits from the phone numbers on the right.
$ cut -c 32-33,38-39,43-44,48-49,64-64,73-77 input
J B C B 9445
S B C B 7993
J B C B 0437
J B C B 2398
F B C B 7149
J B C B OOOO
S B C B 4354
J B C B 8268
F B C B 1298
S B C B 0313
S B C B ZOZI
S B C B 0581
您需要调整实际数据的范围。
以下符合我的理解要求,除了我使用制表符作为输出的字段分隔符以方便您进行调整:
awk 'BEGIN {OFS="\t"} {
# Each line is assumed to have a variable number
# of name fields plus 8 other tokens:
nnames = NF-8;
# from the right:
tel=$NF;
subject2=$(NF-1);
subject1=$(NF-2);
bs=$(NF-3); cpsc=$(NF-4); biss=$(NF-5); data=$(NF-6);
name=;
for (i=2; i<=nnames;i++) {name=name " " $(i+1)}
# Adjustments
data=substr(data,2); biss=substr(biss,2); cpsc=substr(cpsc,2);
subject1=substr(subject1,2)
sub( /[^-]*-/,"", tel);
print , name, data, biss, cpsc, bs, subject1 " " subject2, tel;
}'
输出:
JCG2380 GREEN, JULIE C R-II ISS PSC BS NFO TECH 9445
JAG1936 GREEN, JOE A. O-I ISS PSC BS NFO TECH 7993
ACG4636 GREEN, ADAM C. R-II ISS PSC BS OMP SCI 0437
SPG1696 GREEN, SEAN P. R-I ISS PSC BS OMP SCI 2398
SEG8835 GREEN, SHAWN E. R-II ISS PSC BS OMP SCI 7149
MCGo599 GREEN, MICHAEL C. R-I ISS PSC BS OMP SCI OOOO
GJG1887 GREEN, GREGORY J. O-II ISS PSC BS NFO TECH 4354
NGG5479 GREEN, NICHOLAS G R-I ISS PSC BS NFO TECH 8268
ZTG7190 GREEN, ZACHARY T. R-II ISS PSC BS NFO TECH 1298
AXG9097 GREEN, ALEXANDER O-I ISS PSC BS NFO TECH 0313
RJG6624 GREEN, ROBERT J. O-II ISS PSC BS OMP SCI ZOZI
MWG1990 GREEN, MATTHEW W O-II ISS PSC BS NFO TECH 0581