我们可以使用 AWK 和 gsub() 来处理带有多个冒号“:”的数据吗?如何?
Can we use AWK and gsub() to process data with multiple colons ":" ? How?
以下是数据示例:
Col_01:14 .... Col_20:25 Col_21:23432 Col_22:639142
Col_01:8 .... Col_20:25 Col_22:25134 Col_23:243344
Col_01:17 .... Col_21:75 Col_23:79876 Col_25:634534 Col_22:5 Col_24:73453
Col_01:19 .... Col_20:25 Col_21:32425 Col_23:989423
Col_01:12 .... Col_20:25 Col_21:23424 Col_22:342421 Col_23:7 Col_24:13424 Col_25:67
Col_01:3 .... Col_20:95 Col_21:32121 Col_25:111231
如您所见,其中一些列的顺序不正确...
现在,我认为将此文件导入数据帧的正确方法是预处理数据,以便您可以输出具有 NaN
值的数据帧,例如
Col_01 .... Col_20 Col_21 Col22 Col23 Col24 Col25
8 .... 25 NaN 25134 243344 NaN NaN
17 .... NaN 75 2 79876 73453 634534
19 .... 25 32425 NaN 989423 NaN NaN
12 .... 25 23424 342421 7 13424 67
3 .... 95 32121 NaN NaN NaN 111231
@JamesBrown 在这里展示了解决方案:
使用上述 awk 脚本:
BEGIN {
PROCINFO["sorted_in"]="@ind_str_asc" # traversal order for for(i in a)
}
NR==1 { # the header cols is in the beginning of data file
# FORGET THIS: header cols from another file replace NR==1 with NR==FNR and see * below
split([=12=],a," ") # mkheader a[1]=first_col ...
for(i in a) { # replace with a[first_col]="" ...
a[a[i]]
printf "%6s%s", a[i], OFS # output the header
delete a[i] # remove a[1], a[2], ...
}
# next # FORGET THIS * next here if cols from another file UNTESTED
}
{
gsub(/: /,"=") # replace key-value separator ": " with "="
split([=12=],b,FS) # split record from ","
for(i in b) {
split(b[i],c,"=") # split key=value to c[1]=key, c[2]=value
b[c[1]]=c[2] # b[key]=value
}
for(i in a) # go thru headers in a[] and printf from b[]
printf "%6s%s", (i in b?b[i]:"NaN"), OFS; print ""
}
并将 header 放入文本文件 cols.txt
Col_01 Col_20 Col_21 Col_22 Col_23 Col_25
我现在的问题是:如果我们的数据不是 column: value
而是 column: value1: value2: value3
,我们如何使用 awk?
我们希望数据库条目是 value1: value2: value3
这是新数据:
Col_01:14:a:47 .... Col_20:25:i:z Col_21:23432:6:b Col_22:639142:4:x
Col_01:8:z .... Col_20:25:i:4 Col_22:25134:u:0 Col_23:243344:5:6
Col_01:17:7:z .... Col_21:75:u:q Col_23:79876:u:0 Col_25:634534:8:1
我们仍然预先提供 cols.txt
的列
我们如何创建一个类似的数据库结构?是否可以使用gsub()
限制为:
之前的第一个值,与header相同?
编辑:这并不必须基于awk。任何语言都行
这是另一种选择...
$ awk -v OFS='\t' '{for(i=1;i<NF;i+=2) # iterate over name: value pairs
{c=$i; # copy name in c to modify
sub(/:/,"",c); # remove colon
a[NR,c]=$(i+1); # collect data by row number, name
cols[c]}} # save name
END{n=asorti(cols,icols); # sort names
for(j=1;j<=n;j++) printf "%s", icols[j] OFS; # print header
print "";
for(i=1;i<=NR;i++) # print data
{for(j=1;j<=n;j++)
{v=a[i,icols[j]];
printf "%s", (v?v:"NaN") OFS} # replace missing data with NaN
print ""}}' file | column -t # pipe to column for pretty print
Col_01 Col_20 Col_21 Col_22 Col_23 Col_25
14:a:47 25:i:z 23432:6:b 639142:4:x NaN NaN
8:z 25:i:4 NaN 25134:u:0 243344:5:6 NaN
17:7:z NaN 75:u:q NaN 79876:u:0 634534:8:1
我也有 karakfa 的回答。如果列名与值之间没有空格分隔(例如,如果您有 Col_01:14:a:47
),那么您可以这样做(使用 GNU awk 扩展 match
函数)
{
for (i=1; i<=NF; i++) {
match($i, /^([^:]+):(.*)/, m)
a[NR,m[1]] = m[2]
cols[m[1]]
}
}
END块相同
使用 Awk 范式的 TXR's Lisp macro implementation:
(awk (:set ft #/-?\d+/) ;; ft is "field tokenize" (no counterpart in Awk)
(:let (tab (hash :equal-based)) (max-col 1) (width 8))
((ff (mapcar toint) (tuples 2)) ;; filter fields to int and shore up into pairs
(set max-col (max max-col [find-max [mapcar first f]]))
(mapdo (ado set [tab ^(,nr ,@1)] @2) f)) ;; stuff data into table
(:end (let ((headings (mapcar (opip (format nil "Col~,02a")
`@{@1 width}`)
(range 1 max-col))))
(put-line `@{headings " "}`))
(each ((row (range 1 nr)))
(let ((cols (mapcar (opip (or [tab ^(,row ,@1)] "NaN")
`@{@1 width}`)
(range 1 max-col))))
(put-line `@{cols " "}`)))))
较小的样本数据:
Col_01: 14 Col_04: 25 Col_06: 23432 Col_07: 639142
Col_02: 8 Col_03: 25 Col_05: 25134 Col_06: 243344
Col_01: 17
Col_06: 19 Col_07: 32425
运行:
$ txr reformat.tl data-small
Col01 Col02 Col03 Col04 Col05 Col06 Col07
14 NaN NaN 25 NaN 23432 639142
NaN 8 25 NaN 25134 243344 NaN
17 NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN 19 32425
P.S。 opip
is a macro which boostraps from the op
部分函数应用宏; opip
隐式地将 op
分配到它的参数表达式中,然后将生成的函数链接到一个函数管道中:因此是“op
-pipe”。在每个管道元素中,可以引用其自己编号的隐式参数:@1
、@2
、...如果它们不存在,则部分应用的函数隐式接收管道对象作为其最右边的参数。
^(,row ,@1)
语法是 TXR Lisp 的反引号。主流 Lisp 方言用于反引号的反引号已经用于字符串准引号。这等效于 (list row @1)
:创建一个列表,其中包含 row
的值和隐式的 op/do
生成的函数参数 @1
。两个元素的列表被用作哈希键,模拟二维数组。为此,散列必须是 :equal-based
。如果列表 (1 2)
(1 2)
是单独的实例而不是同一个对象,则它们不是 eql
;它们在 equal
函数下比较相等。
纯属娱乐,一些看不懂的perl
perl -aE'%l=%{{@F}};while(($k,$v)=each%l){$c{$k}=1;$a[$.]{$k}=$v}END{$,="\t";say@c=sort keys%c;for$i(1..$.){say map{$a[$i]{$_}//"NaN"}@c}}' input
(社区维基来掩饰我的耻辱...)
打了几个字:
perl -aE'while(@F){$c{$k=shift@F}=1;$data[$.]{$k}=shift@F}END{$,="\t";say@c=sort keys%c;for$i(1..$.){say map{$data[$i]{$_}//"NaN"}@c}}' input
以下是数据示例:
Col_01:14 .... Col_20:25 Col_21:23432 Col_22:639142
Col_01:8 .... Col_20:25 Col_22:25134 Col_23:243344
Col_01:17 .... Col_21:75 Col_23:79876 Col_25:634534 Col_22:5 Col_24:73453
Col_01:19 .... Col_20:25 Col_21:32425 Col_23:989423
Col_01:12 .... Col_20:25 Col_21:23424 Col_22:342421 Col_23:7 Col_24:13424 Col_25:67
Col_01:3 .... Col_20:95 Col_21:32121 Col_25:111231
如您所见,其中一些列的顺序不正确...
现在,我认为将此文件导入数据帧的正确方法是预处理数据,以便您可以输出具有 NaN
值的数据帧,例如
Col_01 .... Col_20 Col_21 Col22 Col23 Col24 Col25
8 .... 25 NaN 25134 243344 NaN NaN
17 .... NaN 75 2 79876 73453 634534
19 .... 25 32425 NaN 989423 NaN NaN
12 .... 25 23424 342421 7 13424 67
3 .... 95 32121 NaN NaN NaN 111231
@JamesBrown 在这里展示了解决方案:
使用上述 awk 脚本:
BEGIN {
PROCINFO["sorted_in"]="@ind_str_asc" # traversal order for for(i in a)
}
NR==1 { # the header cols is in the beginning of data file
# FORGET THIS: header cols from another file replace NR==1 with NR==FNR and see * below
split([=12=],a," ") # mkheader a[1]=first_col ...
for(i in a) { # replace with a[first_col]="" ...
a[a[i]]
printf "%6s%s", a[i], OFS # output the header
delete a[i] # remove a[1], a[2], ...
}
# next # FORGET THIS * next here if cols from another file UNTESTED
}
{
gsub(/: /,"=") # replace key-value separator ": " with "="
split([=12=],b,FS) # split record from ","
for(i in b) {
split(b[i],c,"=") # split key=value to c[1]=key, c[2]=value
b[c[1]]=c[2] # b[key]=value
}
for(i in a) # go thru headers in a[] and printf from b[]
printf "%6s%s", (i in b?b[i]:"NaN"), OFS; print ""
}
并将 header 放入文本文件 cols.txt
Col_01 Col_20 Col_21 Col_22 Col_23 Col_25
我现在的问题是:如果我们的数据不是 column: value
而是 column: value1: value2: value3
,我们如何使用 awk?
我们希望数据库条目是 value1: value2: value3
这是新数据:
Col_01:14:a:47 .... Col_20:25:i:z Col_21:23432:6:b Col_22:639142:4:x
Col_01:8:z .... Col_20:25:i:4 Col_22:25134:u:0 Col_23:243344:5:6
Col_01:17:7:z .... Col_21:75:u:q Col_23:79876:u:0 Col_25:634534:8:1
我们仍然预先提供 cols.txt
我们如何创建一个类似的数据库结构?是否可以使用gsub()
限制为:
之前的第一个值,与header相同?
编辑:这并不必须基于awk。任何语言都行
这是另一种选择...
$ awk -v OFS='\t' '{for(i=1;i<NF;i+=2) # iterate over name: value pairs
{c=$i; # copy name in c to modify
sub(/:/,"",c); # remove colon
a[NR,c]=$(i+1); # collect data by row number, name
cols[c]}} # save name
END{n=asorti(cols,icols); # sort names
for(j=1;j<=n;j++) printf "%s", icols[j] OFS; # print header
print "";
for(i=1;i<=NR;i++) # print data
{for(j=1;j<=n;j++)
{v=a[i,icols[j]];
printf "%s", (v?v:"NaN") OFS} # replace missing data with NaN
print ""}}' file | column -t # pipe to column for pretty print
Col_01 Col_20 Col_21 Col_22 Col_23 Col_25
14:a:47 25:i:z 23432:6:b 639142:4:x NaN NaN
8:z 25:i:4 NaN 25134:u:0 243344:5:6 NaN
17:7:z NaN 75:u:q NaN 79876:u:0 634534:8:1
我也有 karakfa 的回答。如果列名与值之间没有空格分隔(例如,如果您有 Col_01:14:a:47
),那么您可以这样做(使用 GNU awk 扩展 match
函数)
{
for (i=1; i<=NF; i++) {
match($i, /^([^:]+):(.*)/, m)
a[NR,m[1]] = m[2]
cols[m[1]]
}
}
END块相同
使用 Awk 范式的 TXR's Lisp macro implementation:
(awk (:set ft #/-?\d+/) ;; ft is "field tokenize" (no counterpart in Awk)
(:let (tab (hash :equal-based)) (max-col 1) (width 8))
((ff (mapcar toint) (tuples 2)) ;; filter fields to int and shore up into pairs
(set max-col (max max-col [find-max [mapcar first f]]))
(mapdo (ado set [tab ^(,nr ,@1)] @2) f)) ;; stuff data into table
(:end (let ((headings (mapcar (opip (format nil "Col~,02a")
`@{@1 width}`)
(range 1 max-col))))
(put-line `@{headings " "}`))
(each ((row (range 1 nr)))
(let ((cols (mapcar (opip (or [tab ^(,row ,@1)] "NaN")
`@{@1 width}`)
(range 1 max-col))))
(put-line `@{cols " "}`)))))
较小的样本数据:
Col_01: 14 Col_04: 25 Col_06: 23432 Col_07: 639142 Col_02: 8 Col_03: 25 Col_05: 25134 Col_06: 243344 Col_01: 17 Col_06: 19 Col_07: 32425
运行:
$ txr reformat.tl data-small Col01 Col02 Col03 Col04 Col05 Col06 Col07 14 NaN NaN 25 NaN 23432 639142 NaN 8 25 NaN 25134 243344 NaN 17 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 19 32425
P.S。 opip
is a macro which boostraps from the op
部分函数应用宏; opip
隐式地将 op
分配到它的参数表达式中,然后将生成的函数链接到一个函数管道中:因此是“op
-pipe”。在每个管道元素中,可以引用其自己编号的隐式参数:@1
、@2
、...如果它们不存在,则部分应用的函数隐式接收管道对象作为其最右边的参数。
^(,row ,@1)
语法是 TXR Lisp 的反引号。主流 Lisp 方言用于反引号的反引号已经用于字符串准引号。这等效于 (list row @1)
:创建一个列表,其中包含 row
的值和隐式的 op/do
生成的函数参数 @1
。两个元素的列表被用作哈希键,模拟二维数组。为此,散列必须是 :equal-based
。如果列表 (1 2)
(1 2)
是单独的实例而不是同一个对象,则它们不是 eql
;它们在 equal
函数下比较相等。
纯属娱乐,一些看不懂的perl
perl -aE'%l=%{{@F}};while(($k,$v)=each%l){$c{$k}=1;$a[$.]{$k}=$v}END{$,="\t";say@c=sort keys%c;for$i(1..$.){say map{$a[$i]{$_}//"NaN"}@c}}' input
(社区维基来掩饰我的耻辱...)
打了几个字:
perl -aE'while(@F){$c{$k=shift@F}=1;$data[$.]{$k}=shift@F}END{$,="\t";say@c=sort keys%c;for$i(1..$.){say map{$data[$i]{$_}//"NaN"}@c}}' input