更改 html 并保存 html 文档

altering html and saving html doc

我已经为此工作了一段时间,但也许我只是在搜索错误的东西来获得我需要的答案。

我有一本字典,它的关键字是我想在网页中找到的特定单词。然后我想突出显示这些词并将结果 HTML 保存到本地文件中。

编辑:后来我想到人们喜欢自己执行代码。这个link includes the word dictionaries and the HTML of the page I am using to test my code as it should have the most matches of any of the pages i am scanning. Alternately you can use the actual website。 link 将替换代码中的 rl[0]。

    try:
        #rl[0] refers to a specific url being pulled from a list in another file.
        req = urllib.request.Request(rl[0],None,headers)
        opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPCookieProcessor(cj))
        resp = opener.open(req)
        soup = BeautifulSoup(resp.read(),'html.parser')
        resp.close
    except urllib.error.URLError:
        print("URL error when opening "+rl[0])
    except urllib.error.HTTPError:
        print("HTTP error when opening "+rl[0])
    except http.client.HTTPException as err:
        print(err, "HTTP exception error when opening "+rl[0])
    except socket.timeout:
        print("connection timedout accessing "+rl[0])
        soup = None
    else:
        for l in [wdict1,wdict2,wdict3,wdict4]: 
            for i in l:
                foundvocab = soup.find_all(text=re.compile(i))
                for term in foundvocab:
                    #c indicates the highlight color determined earlier in the script based on which dictionary the word came from.
                    #numb is a term i defined earlier to use as a reference to another document this script creates.
                    fixed = term.replace(i,'<mark background-color="'+c+'">'+i+'<sup>'+numb+'</sup></mark>') 
                    term.replace_with(fixed)
        print(soup, file=path/local.html)

我遇到的问题是,当 soup 打印时,它会为它找到的每个单词打印整个段落,但不会突出显示。或者我可以说:

    foundvocab = soup.find_all(text=i)

生成的 HTML 文件是空白的。

好的。下面的代码找到并替换了我需要的东西。现在我只是遇到大于和小于符号的显示问题,这是一个不同的问题。感谢那些抽出时间来看一看的人。

    foundvocab = soup.findAll(text=re.compile('{0}'.format(i)), recursive=True)
    for fw in foundvocab:
        fixed_text = fw.replace('{0}'.format(i), '<mark background-color="'+c+'">'+i+'<sup>'+numb+'</sup></mark>')
        fw.replace_with(fixed_text)