如何在 AWK 中打印 JSON 个对象
How to print JSON objects in AWK
我一直在寻找 awk 中的一些内置函数来轻松生成 JSON 对象。我遇到了几个答案并决定创建我自己的答案。
我想从存储 table 样式数据的多维数组生成 JSON,并使用要从中生成的 JSON 模式的单独和动态定义数据。
期望的输出:
{
"Name": JanA
"Surname": NowakA
"ID": 1234A
"Role": PrezesA
}
{
"Name": JanD
"Surname": NowakD
"ID": 12341D
"Role": PrezesD
}
{
"Name": JanC
"Surname": NowakC
"ID": 12342C
"Role": PrezesC
}
输入文件:
pierwsza linia
druga linia
trzecia linia
dane wspólników
imie JanA
nazwisko NowakA
pesel 11111111111A
funkcja PrezesA
imie Ja"nD
nazwisko NowakD
pesel 11111111111
funkcja PrezesD
imie JanC
nazwisko NowakC
pesel 12342C
funkcja PrezesC
czwarta linia
reprezentanci
imie Tomek
基于输入文件我创建了一个多维数组:
JanA NowaA 1234A PrezesA
JanD NowakD 12341D PrezesD
JanC NowakC 12342C PrezesC
我更新了简单数组打印机的 awk 实现,对每一列都进行了基于正则表达式的验证(运行 使用 gawk):
function ltrim(s) { sub(/^[ \t]+/, "", s); return s }
function rtrim(s) { sub(/[ \t]+$/, "", s); return s }
function sTrim(s){
return rtrim(ltrim(s));
}
function jsonEscape(jsValue) {
gsub(/\/, "\\", jsValue)
gsub(/"/, "\\"", jsValue)
gsub(/\b/, "\b", jsValue)
gsub(/\f/, "\f", jsValue)
gsub(/\n/, "\n", jsValue)
gsub(/\r/, "\r", jsValue)
gsub(/\t/, "\t", jsValue)
return jsValue
}
function jsonStringEscapeAndWrap(jsValue) {
return "" jsonEscape(jsValue) ""
}
function jsonPrint(contentArray, contentRowsCount, schemaArray){
result = ""
schemaLength = length(schemaArray)
for (x = 1; x <= contentRowsCount; x++) {
result = result "{"
for(y = 1; y <= schemaLength; y++){
result = result "" sTrim(schemaArray[y]) ":" sTrim(contentArray[x, y])
if(y < schemaLength){
result = result ","
}
}
result = result "}"
if(x < contentRowsCount){
result = result ",\n"
}
}
return result
}
function jsonValidateAndPrint(contentArray, contentRowsCount, schemaArray, schemaColumnsCount, errorArray){
result = ""
errorsCount = 1
for (x = 1; x <= contentRowsCount; x++) {
jsonRow = "{"
for(y = 1; y <= schemaColumnsCount; y++){
regexValue = schemaArray[y, 2]
jsonValue = sTrim(contentArray[x, y])
isValid = jsonValue ~ regexValue
if(isValid == 0){
errorArray[errorsCount, 1] = "" sTrim(schemaArray[y, 1]) ""
errorArray[errorsCount, 2] = "Value " jsonValue " not match format: " regexValue " "
errorArray[errorsCount, 3] = x
errorsCount++
jsonValue = "null"
}
jsonRow = jsonRow "" sTrim(schemaArray[y, 1]) ":" jsonValue
if(y < schemaColumnsCount){
jsonRow = jsonRow ","
}
}
jsonRow = jsonRow "}"
result = result jsonRow
if(x < contentRowsCount){
result = result ",\n"
}
}
return result
}
BEGIN{
rowsCount =1
matchCount = 0
errorsCount = 0
shareholdersJsonSchema[1, 1] = "Imie"
shareholdersJsonSchema[2, 1] = "Nazwisko"
shareholdersJsonSchema[3, 1] = "PESEL"
shareholdersJsonSchema[4, 1] = "Funkcja"
shareholdersJsonSchema[1, 2] = "\.*"
shareholdersJsonSchema[2, 2] = "\.*"
shareholdersJsonSchema[3, 2] = "^[0-9]{11}$"
shareholdersJsonSchema[4, 2] = "\.*"
errorsSchema[1] = "PropertyName"
errorsSchema[2] = "Message"
errorsSchema[3] = "PositionIndex"
resultSchema[1]= "ShareHolders"
resultSchema[2]= "Errors"
}
/dane wspólników/,/czwarta linia/{
if(/imie/ || /nazwisko/ || /pesel/ || /funkcja/){
if(/imie/){
shareholdersArray[rowsCount, 1] = jsonStringEscapeAndWrap()
matchCount++
}
if(/nazwisko/){
shareholdersArray[rowsCount, 2] = jsonStringEscapeAndWrap()
matchCount ++
}
if(/pesel/){
shareholdersArray[rowsCount, 3] =
matchCount ++
}
if(/funkcja/){
shareholdersArray[rowsCount, 4] = jsonStringEscapeAndWrap()
matchCount ++
}
if(matchCount==4){
rowsCount++
matchCount = 0;
}
}
}
END{
shareHolders = jsonValidateAndPrint(shareholdersArray, rowsCount - 1, shareholdersJsonSchema, 4, errorArray)
shareHoldersErrors = jsonPrint(errorArray, length(errorArray) / length(errorsSchema), errorsSchema)
resultArray[1,1] = "\n[\n" shareHolders "\n]\n"
resultArray[1,2] = "\n[\n" shareHoldersErrors "\n]\n"
resultJson = jsonPrint(resultArray, 1, resultSchema)
print resultJson
}
产生输出:
{"ShareHolders":
[
{"Imie":"JanA","Nazwisko":"NowakA","PESEL":null,"Funkcja":"PrezesA"},
{"Imie":"Ja\"nD","Nazwisko":"NowakD","PESEL":11111111111,"Funkcja":"PrezesD"},
{"Imie":"JanC","Nazwisko":"NowakC","PESEL":null,"Funkcja":"PrezesC"}
]
,"Errors":
[
{"PropertyName":"PESEL","Message":"Value 11111111111A not match format: ^[0-9]{11}$ ","PositionIndex":1},
{"PropertyName":"PESEL","Message":"Value 12342C not match format: ^[0-9]{11}$ ","PositionIndex":3}
]
}
我将试一试 gawk 解决方案。缩进并不完美,结果也没有排序(请参阅下面的 "Sorting" 注释),但它至少能够递归地遍历真正的多维数组,并且应该从任何数组中生成有效的、可解析的 JSON . 奖励: 数据数组 是 模式。数组键变为 JSON 键。除了数据数组之外,无需创建单独的架构数组。
请务必使用 true multidimensional array[d1][d2][d3]...
convention of constructing your data array, rather than the concatenated index array[d1,d2,d3...]
约定。
更新:
我有一个更新的 JSON gawk 脚本 post 编辑为 GitHub Gist。尽管下面的脚本已经过测试,可以使用 OP 的数据,自上次编辑此 post 以来,我可能已经做出了改进。 请参阅要点以获得最彻底的测试,bug -压缩版本。
#!/usr/bin/gawk -f
BEGIN { IGNORECASE = 1 }
~ "imie" { record[++idx]["name"] = }
~ "nazwisko" { record[idx]["surname"] = }
~ "pesel" { record[idx]["ID"] = }
~ "funkcja" { record[idx]["role"] = }
END { print serialize(record, "\t") }
# ==== FUNCTIONS ====
function join(arr, sep, _p, i) {
# syntax: join(array, string separator)
# returns a string
for (i in arr) {
_p["result"] = _p["result"] ~ "[[:print:]]" ? _p["result"] sep arr[i] : arr[i]
}
return _p["result"]
}
function quote(str) {
gsub(/\/, "\\", str)
gsub(/\r/, "\r", str)
gsub(/\n/, "\n", str)
gsub(/\t/, "\t", str)
return "\"" str "\""
}
function serialize(arr, indent_with, depth, _p, i, idx) {
# syntax: serialize(array of arrays, indent string)
# returns a JSON formatted string
# sort arrays on key, ensures [...] values remain properly ordered
if (!PROCINFO["sorted_in"]) PROCINFO["sorted_in"] = "@ind_num_asc"
# determine whether array is indexed or associative
for (i in arr) {
_p["assoc"] = or(_p["assoc"], !(++_p["idx"] in arr))
}
# if associative, indent
if (_p["assoc"]) {
for (i = ++depth; i--;) {
_p["end"] = _p["indent"]; _p["indent"] = _p["indent"] indent_with
}
}
for (i in arr) {
# If key length is 0, assume its an empty object
if (!length(i)) return "{}"
# quote key if not already quoted
_p["key"] = i !~ /^".*"$/ ? quote(i) : i
if (isarray(arr[i])) {
if (_p["assoc"]) {
_p["json"][++idx] = _p["indent"] _p["key"] ": " \
serialize(arr[i], indent_with, depth)
} else {
# if indexed array, dont print keys
_p["json"][++idx] = serialize(arr[i], indent_with, depth)
}
} else {
# quote if not numeric, boolean, null, already quoted, or too big for match()
if (!((arr[i] ~ /^[0-9]+([\.e][0-9]+)?$/ && arr[i] !~ /^0[0-9]/) ||
arr[i] ~ /^true|false|null|".*"$/) || length(arr[i]) > 1000)
arr[i] = quote(arr[i])
_p["json"][++idx] = _p["assoc"] ? _p["indent"] _p["key"] ": " arr[i] : arr[i]
}
}
# I trial and errored the hell out of this. Problem is, gawk cant distinguish between
# a value of null and no value. I think this hack is as close as I can get, although
# [""] will become [].
if (!_p["assoc"] && join(_p["json"]) == "\"\"") return "[]"
# surround with curly braces if object, square brackets if array
return _p["assoc"] ? "{\n" join(_p["json"], ",\n") "\n" _p["end"] "}" \
: "[" join(_p["json"], ", ") "]"
}
OP 示例数据的输出结果:
[{
"ID": "1234A",
"name": "JanA",
"role": "PrezesA",
"surname": "NowakA"
}, {
"ID": "12341D",
"name": "JanD",
"role": "PrezesD",
"surname": "NowakD"
}, {
"ID": "12342C",
"name": "JanC",
"role": "PrezesC",
"surname": "NowakC"
}, {
"name": "Tomek"
}]
正在排序
虽然默认情况下结果以只有 gawk 理解的方式排序,但 gawk 可以根据字段对结果进行排序。例如,如果您想对 ID 字段进行排序,请添加此函数:
function cmp_ID(i1, v1, i2, v2) {
if (!isarray(v1) && v1 ~ /"ID"/ ) {
return v1 < v2 ? -1 : (v1 != v2)
}
}
然后将此行插入 print serialize(record)
上方的 END
部分:
PROCINFO["sorted_in"] = "cmp_ID"
有关详细信息,请参阅 Controlling Array Traversal。
我一直在寻找 awk 中的一些内置函数来轻松生成 JSON 对象。我遇到了几个答案并决定创建我自己的答案。
我想从存储 table 样式数据的多维数组生成 JSON,并使用要从中生成的 JSON 模式的单独和动态定义数据。
期望的输出:
{
"Name": JanA
"Surname": NowakA
"ID": 1234A
"Role": PrezesA
}
{
"Name": JanD
"Surname": NowakD
"ID": 12341D
"Role": PrezesD
}
{
"Name": JanC
"Surname": NowakC
"ID": 12342C
"Role": PrezesC
}
输入文件:
pierwsza linia
druga linia
trzecia linia
dane wspólników
imie JanA
nazwisko NowakA
pesel 11111111111A
funkcja PrezesA
imie Ja"nD
nazwisko NowakD
pesel 11111111111
funkcja PrezesD
imie JanC
nazwisko NowakC
pesel 12342C
funkcja PrezesC
czwarta linia
reprezentanci
imie Tomek
基于输入文件我创建了一个多维数组:
JanA NowaA 1234A PrezesA
JanD NowakD 12341D PrezesD
JanC NowakC 12342C PrezesC
我更新了简单数组打印机的 awk 实现,对每一列都进行了基于正则表达式的验证(运行 使用 gawk):
function ltrim(s) { sub(/^[ \t]+/, "", s); return s }
function rtrim(s) { sub(/[ \t]+$/, "", s); return s }
function sTrim(s){
return rtrim(ltrim(s));
}
function jsonEscape(jsValue) {
gsub(/\/, "\\", jsValue)
gsub(/"/, "\\"", jsValue)
gsub(/\b/, "\b", jsValue)
gsub(/\f/, "\f", jsValue)
gsub(/\n/, "\n", jsValue)
gsub(/\r/, "\r", jsValue)
gsub(/\t/, "\t", jsValue)
return jsValue
}
function jsonStringEscapeAndWrap(jsValue) {
return "" jsonEscape(jsValue) ""
}
function jsonPrint(contentArray, contentRowsCount, schemaArray){
result = ""
schemaLength = length(schemaArray)
for (x = 1; x <= contentRowsCount; x++) {
result = result "{"
for(y = 1; y <= schemaLength; y++){
result = result "" sTrim(schemaArray[y]) ":" sTrim(contentArray[x, y])
if(y < schemaLength){
result = result ","
}
}
result = result "}"
if(x < contentRowsCount){
result = result ",\n"
}
}
return result
}
function jsonValidateAndPrint(contentArray, contentRowsCount, schemaArray, schemaColumnsCount, errorArray){
result = ""
errorsCount = 1
for (x = 1; x <= contentRowsCount; x++) {
jsonRow = "{"
for(y = 1; y <= schemaColumnsCount; y++){
regexValue = schemaArray[y, 2]
jsonValue = sTrim(contentArray[x, y])
isValid = jsonValue ~ regexValue
if(isValid == 0){
errorArray[errorsCount, 1] = "" sTrim(schemaArray[y, 1]) ""
errorArray[errorsCount, 2] = "Value " jsonValue " not match format: " regexValue " "
errorArray[errorsCount, 3] = x
errorsCount++
jsonValue = "null"
}
jsonRow = jsonRow "" sTrim(schemaArray[y, 1]) ":" jsonValue
if(y < schemaColumnsCount){
jsonRow = jsonRow ","
}
}
jsonRow = jsonRow "}"
result = result jsonRow
if(x < contentRowsCount){
result = result ",\n"
}
}
return result
}
BEGIN{
rowsCount =1
matchCount = 0
errorsCount = 0
shareholdersJsonSchema[1, 1] = "Imie"
shareholdersJsonSchema[2, 1] = "Nazwisko"
shareholdersJsonSchema[3, 1] = "PESEL"
shareholdersJsonSchema[4, 1] = "Funkcja"
shareholdersJsonSchema[1, 2] = "\.*"
shareholdersJsonSchema[2, 2] = "\.*"
shareholdersJsonSchema[3, 2] = "^[0-9]{11}$"
shareholdersJsonSchema[4, 2] = "\.*"
errorsSchema[1] = "PropertyName"
errorsSchema[2] = "Message"
errorsSchema[3] = "PositionIndex"
resultSchema[1]= "ShareHolders"
resultSchema[2]= "Errors"
}
/dane wspólników/,/czwarta linia/{
if(/imie/ || /nazwisko/ || /pesel/ || /funkcja/){
if(/imie/){
shareholdersArray[rowsCount, 1] = jsonStringEscapeAndWrap()
matchCount++
}
if(/nazwisko/){
shareholdersArray[rowsCount, 2] = jsonStringEscapeAndWrap()
matchCount ++
}
if(/pesel/){
shareholdersArray[rowsCount, 3] =
matchCount ++
}
if(/funkcja/){
shareholdersArray[rowsCount, 4] = jsonStringEscapeAndWrap()
matchCount ++
}
if(matchCount==4){
rowsCount++
matchCount = 0;
}
}
}
END{
shareHolders = jsonValidateAndPrint(shareholdersArray, rowsCount - 1, shareholdersJsonSchema, 4, errorArray)
shareHoldersErrors = jsonPrint(errorArray, length(errorArray) / length(errorsSchema), errorsSchema)
resultArray[1,1] = "\n[\n" shareHolders "\n]\n"
resultArray[1,2] = "\n[\n" shareHoldersErrors "\n]\n"
resultJson = jsonPrint(resultArray, 1, resultSchema)
print resultJson
}
产生输出:
{"ShareHolders":
[
{"Imie":"JanA","Nazwisko":"NowakA","PESEL":null,"Funkcja":"PrezesA"},
{"Imie":"Ja\"nD","Nazwisko":"NowakD","PESEL":11111111111,"Funkcja":"PrezesD"},
{"Imie":"JanC","Nazwisko":"NowakC","PESEL":null,"Funkcja":"PrezesC"}
]
,"Errors":
[
{"PropertyName":"PESEL","Message":"Value 11111111111A not match format: ^[0-9]{11}$ ","PositionIndex":1},
{"PropertyName":"PESEL","Message":"Value 12342C not match format: ^[0-9]{11}$ ","PositionIndex":3}
]
}
我将试一试 gawk 解决方案。缩进并不完美,结果也没有排序(请参阅下面的 "Sorting" 注释),但它至少能够递归地遍历真正的多维数组,并且应该从任何数组中生成有效的、可解析的 JSON . 奖励: 数据数组 是 模式。数组键变为 JSON 键。除了数据数组之外,无需创建单独的架构数组。
请务必使用 true multidimensional array[d1][d2][d3]...
convention of constructing your data array, rather than the concatenated index array[d1,d2,d3...]
约定。
更新:
我有一个更新的 JSON gawk 脚本 post 编辑为 GitHub Gist。尽管下面的脚本已经过测试,可以使用 OP 的数据,自上次编辑此 post 以来,我可能已经做出了改进。 请参阅要点以获得最彻底的测试,bug -压缩版本。
#!/usr/bin/gawk -f
BEGIN { IGNORECASE = 1 }
~ "imie" { record[++idx]["name"] = }
~ "nazwisko" { record[idx]["surname"] = }
~ "pesel" { record[idx]["ID"] = }
~ "funkcja" { record[idx]["role"] = }
END { print serialize(record, "\t") }
# ==== FUNCTIONS ====
function join(arr, sep, _p, i) {
# syntax: join(array, string separator)
# returns a string
for (i in arr) {
_p["result"] = _p["result"] ~ "[[:print:]]" ? _p["result"] sep arr[i] : arr[i]
}
return _p["result"]
}
function quote(str) {
gsub(/\/, "\\", str)
gsub(/\r/, "\r", str)
gsub(/\n/, "\n", str)
gsub(/\t/, "\t", str)
return "\"" str "\""
}
function serialize(arr, indent_with, depth, _p, i, idx) {
# syntax: serialize(array of arrays, indent string)
# returns a JSON formatted string
# sort arrays on key, ensures [...] values remain properly ordered
if (!PROCINFO["sorted_in"]) PROCINFO["sorted_in"] = "@ind_num_asc"
# determine whether array is indexed or associative
for (i in arr) {
_p["assoc"] = or(_p["assoc"], !(++_p["idx"] in arr))
}
# if associative, indent
if (_p["assoc"]) {
for (i = ++depth; i--;) {
_p["end"] = _p["indent"]; _p["indent"] = _p["indent"] indent_with
}
}
for (i in arr) {
# If key length is 0, assume its an empty object
if (!length(i)) return "{}"
# quote key if not already quoted
_p["key"] = i !~ /^".*"$/ ? quote(i) : i
if (isarray(arr[i])) {
if (_p["assoc"]) {
_p["json"][++idx] = _p["indent"] _p["key"] ": " \
serialize(arr[i], indent_with, depth)
} else {
# if indexed array, dont print keys
_p["json"][++idx] = serialize(arr[i], indent_with, depth)
}
} else {
# quote if not numeric, boolean, null, already quoted, or too big for match()
if (!((arr[i] ~ /^[0-9]+([\.e][0-9]+)?$/ && arr[i] !~ /^0[0-9]/) ||
arr[i] ~ /^true|false|null|".*"$/) || length(arr[i]) > 1000)
arr[i] = quote(arr[i])
_p["json"][++idx] = _p["assoc"] ? _p["indent"] _p["key"] ": " arr[i] : arr[i]
}
}
# I trial and errored the hell out of this. Problem is, gawk cant distinguish between
# a value of null and no value. I think this hack is as close as I can get, although
# [""] will become [].
if (!_p["assoc"] && join(_p["json"]) == "\"\"") return "[]"
# surround with curly braces if object, square brackets if array
return _p["assoc"] ? "{\n" join(_p["json"], ",\n") "\n" _p["end"] "}" \
: "[" join(_p["json"], ", ") "]"
}
OP 示例数据的输出结果:
[{
"ID": "1234A",
"name": "JanA",
"role": "PrezesA",
"surname": "NowakA"
}, {
"ID": "12341D",
"name": "JanD",
"role": "PrezesD",
"surname": "NowakD"
}, {
"ID": "12342C",
"name": "JanC",
"role": "PrezesC",
"surname": "NowakC"
}, {
"name": "Tomek"
}]
正在排序
虽然默认情况下结果以只有 gawk 理解的方式排序,但 gawk 可以根据字段对结果进行排序。例如,如果您想对 ID 字段进行排序,请添加此函数:
function cmp_ID(i1, v1, i2, v2) {
if (!isarray(v1) && v1 ~ /"ID"/ ) {
return v1 < v2 ? -1 : (v1 != v2)
}
}
然后将此行插入 print serialize(record)
上方的 END
部分:
PROCINFO["sorted_in"] = "cmp_ID"
有关详细信息,请参阅 Controlling Array Traversal。