删除 HTML 标记

Question

我正在为 python class 自动执行标记程序。但是，当我在线下载提交的内容时，它们包含 html 标记，学生可能无意中提交了他们的解决方案，例如：

<!DOCTYPE html><html><head><meta charset="UTF-8"></head><body><p><span style="font-family:'courier new', courier, monospace;">print("Bob and Bill Tiling Solutions Inc.")</span></p>
<p><span style="font-family:'courier new', courier, monospace;">h=int(input("Height   (m):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">w=int(input("Width    (m):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">p=int(input("Cost ($/m^2):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">print("The total cost for this job: $" + str(h*w*p+20))</span></p>
<p> </p></body></html>

有什么方法可以批量删除标记，这样剩下的就是：

print("Bob and Bill Tiling Solutions Inc.")
h=int(input("Height   (m):"))
w=int(input("Width    (m):"))
p=int(input("Cost ($/m^2):"))
print("The total cost for this job: $" + str(h*w*p+20))

如果有第三方实用程序可以执行此操作，我很乐意下载它。

我已经尝试通过 findstr 使用正则表达式但无济于事（我的搜索字符串是 "<[^>]*>" 但我不知道如何使用 findstr 删除文本中的所有结果文件)

欢迎提出任何建议。

Answer 1

这是一个 SED 脚本（我使用 GNUSED），我改编自 Eric Pement 的 SED One-liners：

sed 行

sed -f dehtml.sed yourfilename

文件dehtml.sed

:a
s/<[^>]*>//g;/</N;//ba

删除 HTML 标记

Remove HTML MarkUp

markup

automation

batch-file