从 bash 中分隔符之间的文件中提取多行文本
Extracting multi line text from a file between delimiters in bash
我正在尝试从文本文件中提取多行文本,其中的值由定界符分隔,并将其保存到字符串或数组中。大多数值都被 awk 提取并保存到变量中,但是当我需要将特定产品的多行描述提取到 variable/array.
时会出现问题
简化的输入文件语法如下所示:
ID;Name;value1;value2;DESCRIPTION;valueX;valueY;
我正在使用 awk -F ";" '{print }'
提取第一个值,将它们分配给未来操作的变量,它工作正常但问题出现在“DESCRIPTION”部分,因为它的多行带有 HTML 标签. DESCRIPTION 的示例:
value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY
您能否建议一种完成此操作的方法,以便我可以将 DESCRIPTION 分配给 bash 脚本中的某种变量或数组并进一步操作它?
您(最初)要求基于 awk
的解决方案。正如评论中提到的其他人一样,有更好的工具来完成这项工作。也就是说,基于 4.9 Multiple-Line Records and 4.7 Defining Fields by Content 你可以尝试类似的东西:
$ awk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
[...]
$ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
RS = ";\n"
假设您的输入文件有多个 ID;Name;value1;value2;DESCRIPTION;valueX;valueY;
记录,并且这些记录用 ;
分隔(这是 ;
之后的 valueY
在你的例子中)后跟 newline
.
FPAT = "([^;]+)|(\"<p.+p>\")"
是一种“尽力而为”的方法,用于告诉 (g)awk
您的记录字段的外观。您可能需要根据需要修改它。实际上是说有两种字段格式(参见(...)|(...)
)。第一种字段格式捕获不包含;
的字符串,用于捕获除DESCRIPTION
以外的所有字段。第二种字段格式捕获以 "<
开头并以 >"
. 结尾的字符串
针对具有 2 ID;Name;value1;value2;DESCRIPTION;valueX;valueY;
:
的文件
$ cat testfile
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
$ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
NF = 7
= ID
= Name
= value1
= value2
= "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
= valueX
= valueY
NF = 7
= ID
= Name
= value1
= value2
= "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
= valueX
= valueY
我正在尝试从文本文件中提取多行文本,其中的值由定界符分隔,并将其保存到字符串或数组中。大多数值都被 awk 提取并保存到变量中,但是当我需要将特定产品的多行描述提取到 variable/array.
时会出现问题简化的输入文件语法如下所示:
ID;Name;value1;value2;DESCRIPTION;valueX;valueY;
我正在使用 awk -F ";" '{print }'
提取第一个值,将它们分配给未来操作的变量,它工作正常但问题出现在“DESCRIPTION”部分,因为它的多行带有 HTML 标签. DESCRIPTION 的示例:
value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY
您能否建议一种完成此操作的方法,以便我可以将 DESCRIPTION 分配给 bash 脚本中的某种变量或数组并进一步操作它?
您(最初)要求基于 awk
的解决方案。正如评论中提到的其他人一样,有更好的工具来完成这项工作。也就是说,基于 4.9 Multiple-Line Records and 4.7 Defining Fields by Content 你可以尝试类似的东西:
$ awk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
[...]
$ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
RS = ";\n"
假设您的输入文件有多个ID;Name;value1;value2;DESCRIPTION;valueX;valueY;
记录,并且这些记录用;
分隔(这是;
之后的valueY
在你的例子中)后跟newline
.FPAT = "([^;]+)|(\"<p.+p>\")"
是一种“尽力而为”的方法,用于告诉(g)awk
您的记录字段的外观。您可能需要根据需要修改它。实际上是说有两种字段格式(参见(...)|(...)
)。第一种字段格式捕获不包含;
的字符串,用于捕获除DESCRIPTION
以外的所有字段。第二种字段格式捕获以"<
开头并以>"
. 结尾的字符串
针对具有 2 ID;Name;value1;value2;DESCRIPTION;valueX;valueY;
:
$ cat testfile
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
$ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
NF = 7
= ID
= Name
= value1
= value2
= "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
= valueX
= valueY
NF = 7
= ID
= Name
= value1
= value2
= "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>
<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
= valueX
= valueY