从 bash 中分隔符之间的文件中提取多行文本

Extracting multi line text from a file between delimiters in bash

我正在尝试从文本文件中提取多行文本,其中的值由定界符分隔,并将其保存到字符串或数组中。大多数值都被 awk 提取并保存到变量中,但是当我需要将特定产品的多行描述提取到 variable/array.

时会出现问题

简化的输入文件语法如下所示: ID;Name;value1;value2;DESCRIPTION;valueX;valueY;

我正在使用 awk -F ";" '{print }' 提取第一个值,将它们分配给未来操作的变量,它工作正常但问题出现在“DESCRIPTION”部分,因为它的多行带有 HTML 标签. DESCRIPTION 的示例:

value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>


<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY

您能否建议一种完成此操作的方法,以便我可以将 DESCRIPTION 分配给 bash 脚本中的某种变量或数组并进一步操作它?

您(最初)要求基于 awk 的解决方案。正如评论中提到的其他人一样,有更好的工具来完成这项工作。也就是说,基于 4.9 Multiple-Line Records and 4.7 Defining Fields by Content 你可以尝试类似的东西:

$ awk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
[...]
$ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
  1. RS = ";\n" 假设您的输入文件有多个 ID;Name;value1;value2;DESCRIPTION;valueX;valueY; 记录,并且这些记录用 ; 分隔(这是 ; 之后的 valueY 在你的例子中)后跟 newline.
  2. FPAT = "([^;]+)|(\"<p.+p>\")" 是一种“尽力而为”的方法,用于告诉 (g)awk 您的记录字段的外观。您可能需要根据需要修改它。实际上是说有两种字段格式(参见(...)|(...))。第一种字段格式捕获不包含;的字符串,用于捕获除DESCRIPTION以外的所有字段。第二种字段格式捕获以 "< 开头并以 >".
  3. 结尾的字符串

针对具有 2 ID;Name;value1;value2;DESCRIPTION;valueX;valueY;:

的文件
$ cat testfile 
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>


<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;
ID;Name;value1;value2;"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
  

<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>";valueX;valueY;

$ awk 'BEGIN {RS = ";\n"; FPAT = "([^;]+)|(\"<p.+p>\")" } { print "NF = ", NF; for (i = 1; i <= NF; i++) { printf("$%d = %s\n", i, $i) } }' testfile
NF =  7
 = ID
 = Name
 = value1
 = value2
 = "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>


<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
 = valueX
 = valueY
NF =  7
 = ID
 = Name
 = value1
 = value2
 = "<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
  

<strong>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</strong>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. </p>

<p style=""text-align: center;"">
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>"
 = valueX
 = valueY