在文件中找到一行,解析>和<之间的内容,然后在前后三行添加一行
Find a line in a file, parse content between > and <, then add a line three lines after or before
我需要编辑一个包含几千节的 kml 文件,如下所示。我可以围绕逻辑思考,但实际的实现超出了我的范围。
程序上我需要:
- 找到包含Sub_Name
的行
- 解析该行以获取 > 和 <
之间的内容
- 在我找到该行(或 tac 文件)前 4 行添加该内容
- 重复清洗漂洗
我觉得我应该能够使用 bash 脚本和一些适度彻底的 sed 和 awk 命令来做到这一点,但是当我开始嵌套所有的坑时。
<Placemark>
<name>THIS LINE NEEDS TO BE ADDED FROM THE Sub_Name LINE</name>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#gmaps">
<SimpleData name="EntID">1274433</SimpleData>
<SimpleData name="Sub_Name">HYDE PARK</SimpleData>
<SimpleData name="ORIG_FID">39</SimpleData>
<SimpleData name="Scode">S5435</SimpleData>
<SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
<SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
这与this问题非常相似,但我已经解析了一个小时,无法使其适合我的场景。
感谢您的任何建议和指导。
简单的方法就是分 2 次完成:
$ cat tst.awk
NR==FNR {
if ( /Sub_Name/ ) {
gsub(/[[:space:]]*<[^<>]+>/,"")
names[NR-4] = ORS "<name>" [=10=] "</name>"
}
next
}
{ print [=10=] names[FNR] }
$ awk -f tst.awk file file
<Placemark>
<name>HYDE PARK</name>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#gmaps">
<SimpleData name="EntID">1274433</SimpleData>
<SimpleData name="Sub_Name">HYDE PARK</SimpleData>
<SimpleData name="ORIG_FID">39</SimpleData>
<SimpleData name="Scode">S5435</SimpleData>
<SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
<SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
以上是从这个输入文件生成的:
$ cat file
<Placemark>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#gmaps">
<SimpleData name="EntID">1274433</SimpleData>
<SimpleData name="Sub_Name">HYDE PARK</SimpleData>
<SimpleData name="ORIG_FID">39</SimpleData>
<SimpleData name="Scode">S5435</SimpleData>
<SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
<SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
稍微困难一点的方法是保持 4 行的滚动缓冲区,并始终打印读取的倒数第 4 行,但这只有在您的输入来自管道或您的文件太大而您负担不起时才有必要解析它两次的时间或将所有 "name" 行存储在数组中的内存。
关于在没有 HTML 解析器的情况下尝试解析 HTML 的危险的常见警告适用...
鉴于:
$ cat xml_file
<Placemark>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#gmaps">
<SimpleData name="EntID">1274433</SimpleData>
<SimpleData name="Sub_Name">HYDE PARK</SimpleData>
<SimpleData name="ORIG_FID">39</SimpleData>
<SimpleData name="Scode">S5435</SimpleData>
<SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
<SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
</Placemark>
如果您想要解析那个XML并使用xpath找到嵌套子节点的值并添加另一个节点,您可能会这样做这些方面的东西(例如Ruby):
$ ruby -r nokogiri -e 'doc=Nokogiri::XML($<.read) # {|opt| opt.strict.noblanks }
t1=doc.at_css "Placemark"
t2 = Nokogiri::XML::Node.new "name", doc
t2.parent=t1
t2.content=doc.xpath("//SimpleData[@name=\"Sub_Name\"]").text
puts doc
' xml_file
打印:
<?xml version="1.0"?>
<Placemark>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#gmaps">
<SimpleData name="EntID">1274433</SimpleData>
<SimpleData name="Sub_Name">HYDE PARK</SimpleData>
<SimpleData name="ORIG_FID">39</SimpleData>
<SimpleData name="Scode">S5435</SimpleData>
<SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
<SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
<name>HYDE PARK</name></Placemark>
(请注意,插入的节点 <name>HYDE PARK</name>
位于 <Placemark>
节点的末尾,因为模式未指定 XML 顺序。)
任何其他带有 XML 解析器的脚本语言都是类似的(Ruby、Python、Perl、jq 等)
我需要编辑一个包含几千节的 kml 文件,如下所示。我可以围绕逻辑思考,但实际的实现超出了我的范围。
程序上我需要:
- 找到包含Sub_Name 的行
- 解析该行以获取 > 和 < 之间的内容
- 在我找到该行(或 tac 文件)前 4 行添加该内容
- 重复清洗漂洗
我觉得我应该能够使用 bash 脚本和一些适度彻底的 sed 和 awk 命令来做到这一点,但是当我开始嵌套所有的坑时。
<Placemark>
<name>THIS LINE NEEDS TO BE ADDED FROM THE Sub_Name LINE</name>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#gmaps">
<SimpleData name="EntID">1274433</SimpleData>
<SimpleData name="Sub_Name">HYDE PARK</SimpleData>
<SimpleData name="ORIG_FID">39</SimpleData>
<SimpleData name="Scode">S5435</SimpleData>
<SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
<SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
这与this问题非常相似,但我已经解析了一个小时,无法使其适合我的场景。
感谢您的任何建议和指导。
简单的方法就是分 2 次完成:
$ cat tst.awk
NR==FNR {
if ( /Sub_Name/ ) {
gsub(/[[:space:]]*<[^<>]+>/,"")
names[NR-4] = ORS "<name>" [=10=] "</name>"
}
next
}
{ print [=10=] names[FNR] }
$ awk -f tst.awk file file
<Placemark>
<name>HYDE PARK</name>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#gmaps">
<SimpleData name="EntID">1274433</SimpleData>
<SimpleData name="Sub_Name">HYDE PARK</SimpleData>
<SimpleData name="ORIG_FID">39</SimpleData>
<SimpleData name="Scode">S5435</SimpleData>
<SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
<SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
以上是从这个输入文件生成的:
$ cat file
<Placemark>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#gmaps">
<SimpleData name="EntID">1274433</SimpleData>
<SimpleData name="Sub_Name">HYDE PARK</SimpleData>
<SimpleData name="ORIG_FID">39</SimpleData>
<SimpleData name="Scode">S5435</SimpleData>
<SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
<SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
稍微困难一点的方法是保持 4 行的滚动缓冲区,并始终打印读取的倒数第 4 行,但这只有在您的输入来自管道或您的文件太大而您负担不起时才有必要解析它两次的时间或将所有 "name" 行存储在数组中的内存。
关于在没有 HTML 解析器的情况下尝试解析 HTML 的危险的常见警告适用...
鉴于:
$ cat xml_file
<Placemark>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#gmaps">
<SimpleData name="EntID">1274433</SimpleData>
<SimpleData name="Sub_Name">HYDE PARK</SimpleData>
<SimpleData name="ORIG_FID">39</SimpleData>
<SimpleData name="Scode">S5435</SimpleData>
<SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
<SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
</Placemark>
如果您想要解析那个XML并使用xpath找到嵌套子节点的值并添加另一个节点,您可能会这样做这些方面的东西(例如Ruby):
$ ruby -r nokogiri -e 'doc=Nokogiri::XML($<.read) # {|opt| opt.strict.noblanks }
t1=doc.at_css "Placemark"
t2 = Nokogiri::XML::Node.new "name", doc
t2.parent=t1
t2.content=doc.xpath("//SimpleData[@name=\"Sub_Name\"]").text
puts doc
' xml_file
打印:
<?xml version="1.0"?>
<Placemark>
<Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
<ExtendedData><SchemaData schemaUrl="#gmaps">
<SimpleData name="EntID">1274433</SimpleData>
<SimpleData name="Sub_Name">HYDE PARK</SimpleData>
<SimpleData name="ORIG_FID">39</SimpleData>
<SimpleData name="Scode">S5435</SimpleData>
<SimpleData name="Shape_Leng">1653.15682579000</SimpleData>
<SimpleData name="Shape_Area">13612381.56865700000</SimpleData>
</SchemaData></ExtendedData>
<MultiGeometry><Polygon><altitudeMode>clampToGround</altitudeMode><outerBoundaryIs><LinearRing><altitudeMode>clampToGround</altitudeMode><coordinates>-97.7740412096895,30.4376501989282</coordinates></LinearRing></outerBoundaryIs></Polygon></MultiGeometry>
<name>HYDE PARK</name></Placemark>
(请注意,插入的节点 <name>HYDE PARK</name>
位于 <Placemark>
节点的末尾,因为模式未指定 XML 顺序。)
任何其他带有 XML 解析器的脚本语言都是类似的(Ruby、Python、Perl、jq 等)