更改 html 并保存 html 文档
altering html and saving html doc
我已经为此工作了一段时间,但也许我只是在搜索错误的东西来获得我需要的答案。
我有一本字典,它的关键字是我想在网页中找到的特定单词。然后我想突出显示这些词并将结果 HTML 保存到本地文件中。
编辑:后来我想到人们喜欢自己执行代码。这个link includes the word dictionaries and the HTML of the page I am using to test my code as it should have the most matches of any of the pages i am scanning. Alternately you can use the actual website。 link 将替换代码中的 rl[0]。
try:
#rl[0] refers to a specific url being pulled from a list in another file.
req = urllib.request.Request(rl[0],None,headers)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPCookieProcessor(cj))
resp = opener.open(req)
soup = BeautifulSoup(resp.read(),'html.parser')
resp.close
except urllib.error.URLError:
print("URL error when opening "+rl[0])
except urllib.error.HTTPError:
print("HTTP error when opening "+rl[0])
except http.client.HTTPException as err:
print(err, "HTTP exception error when opening "+rl[0])
except socket.timeout:
print("connection timedout accessing "+rl[0])
soup = None
else:
for l in [wdict1,wdict2,wdict3,wdict4]:
for i in l:
foundvocab = soup.find_all(text=re.compile(i))
for term in foundvocab:
#c indicates the highlight color determined earlier in the script based on which dictionary the word came from.
#numb is a term i defined earlier to use as a reference to another document this script creates.
fixed = term.replace(i,'<mark background-color="'+c+'">'+i+'<sup>'+numb+'</sup></mark>')
term.replace_with(fixed)
print(soup, file=path/local.html)
我遇到的问题是,当 soup 打印时,它会为它找到的每个单词打印整个段落,但不会突出显示。或者我可以说:
foundvocab = soup.find_all(text=i)
生成的 HTML 文件是空白的。
好的。下面的代码找到并替换了我需要的东西。现在我只是遇到大于和小于符号的显示问题,这是一个不同的问题。感谢那些抽出时间来看一看的人。
foundvocab = soup.findAll(text=re.compile('{0}'.format(i)), recursive=True)
for fw in foundvocab:
fixed_text = fw.replace('{0}'.format(i), '<mark background-color="'+c+'">'+i+'<sup>'+numb+'</sup></mark>')
fw.replace_with(fixed_text)
我已经为此工作了一段时间,但也许我只是在搜索错误的东西来获得我需要的答案。
我有一本字典,它的关键字是我想在网页中找到的特定单词。然后我想突出显示这些词并将结果 HTML 保存到本地文件中。
编辑:后来我想到人们喜欢自己执行代码。这个link includes the word dictionaries and the HTML of the page I am using to test my code as it should have the most matches of any of the pages i am scanning. Alternately you can use the actual website。 link 将替换代码中的 rl[0]。
try:
#rl[0] refers to a specific url being pulled from a list in another file.
req = urllib.request.Request(rl[0],None,headers)
opener = urllib.request.build_opener(proxy_support, urllib.request.HTTPCookieProcessor(cj))
resp = opener.open(req)
soup = BeautifulSoup(resp.read(),'html.parser')
resp.close
except urllib.error.URLError:
print("URL error when opening "+rl[0])
except urllib.error.HTTPError:
print("HTTP error when opening "+rl[0])
except http.client.HTTPException as err:
print(err, "HTTP exception error when opening "+rl[0])
except socket.timeout:
print("connection timedout accessing "+rl[0])
soup = None
else:
for l in [wdict1,wdict2,wdict3,wdict4]:
for i in l:
foundvocab = soup.find_all(text=re.compile(i))
for term in foundvocab:
#c indicates the highlight color determined earlier in the script based on which dictionary the word came from.
#numb is a term i defined earlier to use as a reference to another document this script creates.
fixed = term.replace(i,'<mark background-color="'+c+'">'+i+'<sup>'+numb+'</sup></mark>')
term.replace_with(fixed)
print(soup, file=path/local.html)
我遇到的问题是,当 soup 打印时,它会为它找到的每个单词打印整个段落,但不会突出显示。或者我可以说:
foundvocab = soup.find_all(text=i)
生成的 HTML 文件是空白的。
好的。下面的代码找到并替换了我需要的东西。现在我只是遇到大于和小于符号的显示问题,这是一个不同的问题。感谢那些抽出时间来看一看的人。
foundvocab = soup.findAll(text=re.compile('{0}'.format(i)), recursive=True)
for fw in foundvocab:
fixed_text = fw.replace('{0}'.format(i), '<mark background-color="'+c+'">'+i+'<sup>'+numb+'</sup></mark>')
fw.replace_with(fixed_text)