动态填充多维 awk 数组
Dynamically populate multidimensional awk array
我正在编写一个 Awk/Gawk 脚本来解析文件,为每一行填充一个多维数组。第一列是句点分隔的字符串,每个值都是对下一级数组键的引用。第二列是数值
下面是正在解析的内容的示例:
$ echo -e "personal.name.first\t= John\npersonal.name.last\t= Doe\npersonal.other.dob\t= 05/07/87\npersonal.contact.phone\t= 602123456\npersonal.contact.email\t= john.doe@idk\nemployment.jobs.1\t= Company One\nemployment.jobs.2\t= Company Two\nemployment.jobs.3\t= Company Three"
personal.name.first = John
personal.name.last = Doe
personal.other.dob = 05/07/87
personal.contact.phone = 602123456
personal.contact.email = john.doe@idk
employment.jobs.1 = Company One
employment.jobs.2 = Company Two
employment.jobs.3 = Company Three
经过解析后,我希望它具有与以下相同的结构:
data["personal"]["name"]["first"] = "John"
data["personal"]["name"]["last"] = "Doe"
data["personal"]["other"]["dob"] = "05/07/87"
data["personal"]["contact"]["phone"] = "602123456"
data["personal"]["contact"]["email"] = "john.doe@foo.com"
data["employment"]["jobs"]["1"] = Company One
data["employment"]["jobs"]["2"] = Company Two
data["employment"]["jobs"]["3"] = Company Three
我坚持的部分是如何在构建多维数组时动态填充键。
我找到了 this SO thread that covers a similar issue, which was resolved by using the SUBSEP
变量,起初看起来它可以按我的需要工作,但经过一些测试后,它看起来 arr["foo", "bar"] = "baz"
并没有像真正的数组一样被对待,比如正如 arr["foo"]["bar"] = "baz"
那样。我的意思的一个例子是无法计算数组任何级别的值:arr["foo", "bar"] = "baz"; print length(arr["foo"])
只会打印一个 0
(零)
我发现 this SO thread 有点帮助,可能为我指明了正确的方向。
在讨论帖的一个片段中提到:
BEGIN {
x=SUBSEP
a="Red" x "Green" x "Blue"
b="Yellow" x "Cyan" x "Purple"
Colors[1][0] = ""
Colors[2][0] = ""
split(a, Colors[1], x)
split(b, Colors[2], x)
print Colors[2][3]
}
非常接近,但我现在遇到的问题是需要动态指定键(例如:Red
、Green
等),并且可能有一个或多个键。
基本上,我如何获取 a_keys
和 b_keys
字符串,将它们拆分为 .
,并将 a
和 b
变量填充为多维数组?..
BEGIN {
x=SUBSEP
# How can I take these strings...
a_keys = "Red.Green.Blue"
b_keys = "Yellow.Cyan.Purple"
# .. And populate the array, just as this does:
a="Red" x "Green" x "Blue"
b="Yellow" x "Cyan" x "Purple"
Colors[1][0] = ""
Colors[2][0] = ""
split(a, Colors[1], x)
split(b, Colors[2], x)
print Colors[2][3]
}
任何帮助将不胜感激,谢谢!
您只需要:
BEGIN { FS="\t= " }
{
split(,d,/\./)
data[d[1]][d[2]][d[3]] =
}
看:
$ cat tst.awk
BEGIN { FS="\t= " }
{
split(,d,/\./)
data[d[1]][d[2]][d[3]] =
}
END {
for (x in data)
for (y in data[x])
for (z in data[x][y])
print x, y, z, "->", data[x][y][z]
}
$ awk -f tst.awk file
personal other dob -> 05/07/87
personal name first -> John
personal name last -> Doe
personal contact email -> john.doe@idk
personal contact phone -> 602123456
employment jobs 1 -> Company One
employment jobs 2 -> Company Two
employment jobs 3 -> Company Three
以上内容当然是特定于 gawk 的,因为没有其他 awk 支持真正的多维数组。
当索引并不总是具有相同的深度(例如上面的 3)时填充多维数组,它相当复杂:
##########
$ cat tst.awk
function rec_populate(a,idxs,curDepth,maxDepth,tmpIdxSet) {
if ( tmpIdxSet ) {
delete a[SUBSEP] # delete scalar a[]
tmpIdxSet = 0
}
if (curDepth < maxDepth) {
# We need to ensure a[][] exists before calling populate() otherwise
# inside populate() a[] would be a scalar, but then we need to delete
# a[][] inside populate() before trying to create a[][][] because
# creating a[][] below creates IT as scalar. SUBSEP used arbitrarily.
if ( !( (idxs[curDepth] in a) && (SUBSEP in a[idxs[curDepth]]) ) ) {
a[idxs[curDepth]][SUBSEP] # create array a[] + scalar a[][]
tmpIdxSet = 1
}
rec_populate(a[idxs[curDepth]],idxs,curDepth+1,maxDepth,tmpIdxSet)
}
else {
a[idxs[curDepth]] =
}
}
function populate(arr,str,sep, idxs) {
split(str,idxs,sep)
rec_populate(arr,idxs,1,length(idxs),0)
}
{ populate(arr,,",") }
END { walk_array(arr, "arr") }
function walk_array(arr, name, i)
{
# Mostly copied from the following URL, just added setting of "sorted_in":
# https://www.gnu.org/software/gawk/manual/html_node/Walking-Arrays.html
PROCINFO["sorted_in"] = "@ind_str_asc"
for (i in arr) {
if (isarray(arr[i]))
walk_array(arr[i], (name "[" i "]"))
else
printf("%s[%s] = %s\n", name, i, arr[i])
}
}
.
##########
$ cat file
a uno
b,c dos
d,e,f tres_wan
d,e,g tres_twa
d,e,h,i,j cinco
##########
$ awk -f tst.awk file
arr[a] = uno
arr[b][c] = dos
arr[d][e][f] = tres_wan
arr[d][e][g] = tres_twa
arr[d][e][h][i][j] = cinco
没有真正的 multidim 数组,你可以做更多的簿记工作
awk -F'\t= ' '{split(,k,".");
k1[k[1]]; k2[k[2]]; k3[k[3]];
v[k[1],k[2],k[3]]=}
END {for(i1 in k1)
for(i2 in k2)
for(i3 in k3)
if((i1,i2,i3) in v)
print i1,i2,i3," -> ",v[i1,i2,i3]}' file
personal other dob -> 05/07/87
personal name first -> John
personal name last -> Doe
personal contact email -> john.doe@idk
personal contact phone -> 602123456
employment jobs 1 -> Company One
employment jobs 2 -> Company Two
employment jobs 3 -> Company Three
我正在编写一个 Awk/Gawk 脚本来解析文件,为每一行填充一个多维数组。第一列是句点分隔的字符串,每个值都是对下一级数组键的引用。第二列是数值
下面是正在解析的内容的示例:
$ echo -e "personal.name.first\t= John\npersonal.name.last\t= Doe\npersonal.other.dob\t= 05/07/87\npersonal.contact.phone\t= 602123456\npersonal.contact.email\t= john.doe@idk\nemployment.jobs.1\t= Company One\nemployment.jobs.2\t= Company Two\nemployment.jobs.3\t= Company Three"
personal.name.first = John
personal.name.last = Doe
personal.other.dob = 05/07/87
personal.contact.phone = 602123456
personal.contact.email = john.doe@idk
employment.jobs.1 = Company One
employment.jobs.2 = Company Two
employment.jobs.3 = Company Three
经过解析后,我希望它具有与以下相同的结构:
data["personal"]["name"]["first"] = "John"
data["personal"]["name"]["last"] = "Doe"
data["personal"]["other"]["dob"] = "05/07/87"
data["personal"]["contact"]["phone"] = "602123456"
data["personal"]["contact"]["email"] = "john.doe@foo.com"
data["employment"]["jobs"]["1"] = Company One
data["employment"]["jobs"]["2"] = Company Two
data["employment"]["jobs"]["3"] = Company Three
我坚持的部分是如何在构建多维数组时动态填充键。
我找到了 this SO thread that covers a similar issue, which was resolved by using the SUBSEP
变量,起初看起来它可以按我的需要工作,但经过一些测试后,它看起来 arr["foo", "bar"] = "baz"
并没有像真正的数组一样被对待,比如正如 arr["foo"]["bar"] = "baz"
那样。我的意思的一个例子是无法计算数组任何级别的值:arr["foo", "bar"] = "baz"; print length(arr["foo"])
只会打印一个 0
(零)
我发现 this SO thread 有点帮助,可能为我指明了正确的方向。
在讨论帖的一个片段中提到:
BEGIN {
x=SUBSEP
a="Red" x "Green" x "Blue"
b="Yellow" x "Cyan" x "Purple"
Colors[1][0] = ""
Colors[2][0] = ""
split(a, Colors[1], x)
split(b, Colors[2], x)
print Colors[2][3]
}
非常接近,但我现在遇到的问题是需要动态指定键(例如:Red
、Green
等),并且可能有一个或多个键。
基本上,我如何获取 a_keys
和 b_keys
字符串,将它们拆分为 .
,并将 a
和 b
变量填充为多维数组?..
BEGIN {
x=SUBSEP
# How can I take these strings...
a_keys = "Red.Green.Blue"
b_keys = "Yellow.Cyan.Purple"
# .. And populate the array, just as this does:
a="Red" x "Green" x "Blue"
b="Yellow" x "Cyan" x "Purple"
Colors[1][0] = ""
Colors[2][0] = ""
split(a, Colors[1], x)
split(b, Colors[2], x)
print Colors[2][3]
}
任何帮助将不胜感激,谢谢!
您只需要:
BEGIN { FS="\t= " }
{
split(,d,/\./)
data[d[1]][d[2]][d[3]] =
}
看:
$ cat tst.awk
BEGIN { FS="\t= " }
{
split(,d,/\./)
data[d[1]][d[2]][d[3]] =
}
END {
for (x in data)
for (y in data[x])
for (z in data[x][y])
print x, y, z, "->", data[x][y][z]
}
$ awk -f tst.awk file
personal other dob -> 05/07/87
personal name first -> John
personal name last -> Doe
personal contact email -> john.doe@idk
personal contact phone -> 602123456
employment jobs 1 -> Company One
employment jobs 2 -> Company Two
employment jobs 3 -> Company Three
以上内容当然是特定于 gawk 的,因为没有其他 awk 支持真正的多维数组。
当索引并不总是具有相同的深度(例如上面的 3)时填充多维数组,它相当复杂:
##########
$ cat tst.awk
function rec_populate(a,idxs,curDepth,maxDepth,tmpIdxSet) {
if ( tmpIdxSet ) {
delete a[SUBSEP] # delete scalar a[]
tmpIdxSet = 0
}
if (curDepth < maxDepth) {
# We need to ensure a[][] exists before calling populate() otherwise
# inside populate() a[] would be a scalar, but then we need to delete
# a[][] inside populate() before trying to create a[][][] because
# creating a[][] below creates IT as scalar. SUBSEP used arbitrarily.
if ( !( (idxs[curDepth] in a) && (SUBSEP in a[idxs[curDepth]]) ) ) {
a[idxs[curDepth]][SUBSEP] # create array a[] + scalar a[][]
tmpIdxSet = 1
}
rec_populate(a[idxs[curDepth]],idxs,curDepth+1,maxDepth,tmpIdxSet)
}
else {
a[idxs[curDepth]] =
}
}
function populate(arr,str,sep, idxs) {
split(str,idxs,sep)
rec_populate(arr,idxs,1,length(idxs),0)
}
{ populate(arr,,",") }
END { walk_array(arr, "arr") }
function walk_array(arr, name, i)
{
# Mostly copied from the following URL, just added setting of "sorted_in":
# https://www.gnu.org/software/gawk/manual/html_node/Walking-Arrays.html
PROCINFO["sorted_in"] = "@ind_str_asc"
for (i in arr) {
if (isarray(arr[i]))
walk_array(arr[i], (name "[" i "]"))
else
printf("%s[%s] = %s\n", name, i, arr[i])
}
}
.
##########
$ cat file
a uno
b,c dos
d,e,f tres_wan
d,e,g tres_twa
d,e,h,i,j cinco
##########
$ awk -f tst.awk file
arr[a] = uno
arr[b][c] = dos
arr[d][e][f] = tres_wan
arr[d][e][g] = tres_twa
arr[d][e][h][i][j] = cinco
没有真正的 multidim 数组,你可以做更多的簿记工作
awk -F'\t= ' '{split(,k,".");
k1[k[1]]; k2[k[2]]; k3[k[3]];
v[k[1],k[2],k[3]]=}
END {for(i1 in k1)
for(i2 in k2)
for(i3 in k3)
if((i1,i2,i3) in v)
print i1,i2,i3," -> ",v[i1,i2,i3]}' file
personal other dob -> 05/07/87
personal name first -> John
personal name last -> Doe
personal contact email -> john.doe@idk
personal contact phone -> 602123456
employment jobs 1 -> Company One
employment jobs 2 -> Company Two
employment jobs 3 -> Company Three