从 html 文本中提取字符串

Question

我正在使用 curl 获取 html，需要提取 仅第二个 table 语句 。请注意，卷曲的 html 是单个字符串且未格式化。为了更好地解释，请参阅以下内容：（...代表更多 html）

...
<table width="100%" cellpadding="0" cellspacing="0" class="table">
...
</table>
...
#I need to extract the following table
#from here
<table width="100%" cellpadding="4">
...
</table> #to this
...

到目前为止我尝试了多条 SED 线，而且我认为像这样尝试匹配第二条 table 并不是一种顺利的方式：

sed -n '/<table width="100%" cellpadding="4"/,/table>/p'

Answer 1

将下面的脚本保存为 script.py 和运行，如下所示：

python3 script.py input.html

此脚本解析 HTML 并检查属性（[=13=] 和 cellpadding）。这种方法的优点是，如果您更改 HTML 文件的格式，它仍然有效，因为脚本会解析 HTML 而不是依赖于精确的字符串匹配。

from html.parser import HTMLParser
import sys

def print_tag(tag, attrs, end=False):
    line = "<" 
    if end:
        line += "/"
    line += tag
    for attr, value in attrs:
        line += " " + attr + '="' + value + '"'
    print(line + ">", end="")

if len(sys.argv) < 2:
    print("ERROR: expected argument - filename")
    sys.exit(1)

with open(sys.argv[1], 'r', encoding='cp1252') as content_file:
    content = content_file.read()

do_print = False

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        global do_print
        if tag == "table":
            if ("width", "100%") in attrs and ("cellpadding", "4") in attrs:
                do_print = True
        if do_print:
            print_tag(tag, attrs)

    def handle_endtag(self, tag):
        global do_print
        if do_print:
            print_tag(tag, attrs=(), end=True)
            if tag == "table":
                do_print = False

    def handle_data(self, data):
        global do_print
        if do_print:
            print(data, end="")

parser = MyHTMLParser()
parser.feed(content)

Answer 2

html 解析器会更好，但您可以像这样使用 awk：

awk '/<table width="100%" cellpadding="4">/ {f=1} f; /<\/table>/ {f=0}' file
<table width="100%" cellpadding="4">
...
</table> #to this

/<table width="100%" cellpadding="4">/ {f=1} 找到开始时将标志 f 设置为 true
f; 如果 flage f 为真，执行默认操作，打印行。
/<\/table>/ {f=0}当找到结束时，清除标志f停止打印。

这个也可以用，但是更喜欢flag控件：

awk '/<table width="100%" cellpadding="4">/,/<\/table>/' file
<table width="100%" cellpadding="4">
...
</table> #to this

从 html 文本中提取字符串

Extracting string from html text

html

regex

sed

extraction

html-parsing