从网页代码中删除广告
Remove ads from webpage code
我有广告拦截规则列表 (example)
我怎样才能将它们应用到网页上?我用 MechanicalSoup 下载网页代码(基于 BeautifulSoup)。我想保存成bs格式,不过etree也可以。
我尝试使用 following code,但某些页面存在问题:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
所以我想到了这个解决方案:
ADBLOCK_RULES = ['https://easylist-downloads.adblockplus.org/ruadlist+easylist.txt',
'https://filters.adtidy.org/extension/chromium/filters/1.txt']
for rule in ADBLOCK_RULES:
r = requests.get(rule)
with open(rule.rsplit('/', 1)[-1], 'wb') as f:
f.write(r.content)
browser = mechanicalsoup.StatefulBrowser(
soup_config={'features': 'lxml'},
raise_on_404=True
)
response = browser.open(url)
webpage = browser.get_current_page()
html_code = re.sub(r'\n+', '\n', str(webpage))
remover = AdRemover(*[rule.rsplit('/', 1)[-1] for rule in ADBLOCK_RULES])
tree = lxml.html.document_fromstring(html_code)
adblocked = remover.remove_ads(tree)
webpage = BeautifulSoup(ElementTree.tostring(adblocked).decode(), 'lxml')
您需要使用 following code,但 return tree
更新了 remove_ads()
几乎与 Nikita 的 中的代码相同,但希望与所有导入共享它,而不 mechanicalsoup
依赖于想要尝试它的人。
from lxml.etree import tostring
import lxml.html
import requests
# take AdRemover code from here:
# https://github.com/buriy/python-readability/issues/43#issuecomment-321174825
from adremover import AdRemover
url = 'https://google.com' # replace it with a url you want to apply the rules to
rule_urls = ['https://easylist-downloads.adblockplus.org/ruadlist+easylist.txt',
'https://filters.adtidy.org/extension/chromium/filters/1.txt']
rule_files = [url.rpartition('/')[-1] for url in rule_urls]
# download files containing rules
for rule_url, rule_file in zip(rule_urls, rule_files):
r = requests.get(rule_url)
with open(rule_file, 'w') as f:
print(r.text, file=f)
remover = AdRemover(*rule_files)
html = requests.get(url).text
document = lxml.html.document_fromstring(html)
remover.remove_ads(document)
clean_html = tostring(document).decode("utf-8")
我有广告拦截规则列表 (example)
我怎样才能将它们应用到网页上?我用 MechanicalSoup 下载网页代码(基于 BeautifulSoup)。我想保存成bs格式,不过etree也可以。
我尝试使用 following code,但某些页面存在问题:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
所以我想到了这个解决方案:
ADBLOCK_RULES = ['https://easylist-downloads.adblockplus.org/ruadlist+easylist.txt',
'https://filters.adtidy.org/extension/chromium/filters/1.txt']
for rule in ADBLOCK_RULES:
r = requests.get(rule)
with open(rule.rsplit('/', 1)[-1], 'wb') as f:
f.write(r.content)
browser = mechanicalsoup.StatefulBrowser(
soup_config={'features': 'lxml'},
raise_on_404=True
)
response = browser.open(url)
webpage = browser.get_current_page()
html_code = re.sub(r'\n+', '\n', str(webpage))
remover = AdRemover(*[rule.rsplit('/', 1)[-1] for rule in ADBLOCK_RULES])
tree = lxml.html.document_fromstring(html_code)
adblocked = remover.remove_ads(tree)
webpage = BeautifulSoup(ElementTree.tostring(adblocked).decode(), 'lxml')
您需要使用 following code,但 return tree
几乎与 Nikita 的 mechanicalsoup
依赖于想要尝试它的人。
from lxml.etree import tostring
import lxml.html
import requests
# take AdRemover code from here:
# https://github.com/buriy/python-readability/issues/43#issuecomment-321174825
from adremover import AdRemover
url = 'https://google.com' # replace it with a url you want to apply the rules to
rule_urls = ['https://easylist-downloads.adblockplus.org/ruadlist+easylist.txt',
'https://filters.adtidy.org/extension/chromium/filters/1.txt']
rule_files = [url.rpartition('/')[-1] for url in rule_urls]
# download files containing rules
for rule_url, rule_file in zip(rule_urls, rule_files):
r = requests.get(rule_url)
with open(rule_file, 'w') as f:
print(r.text, file=f)
remover = AdRemover(*rule_files)
html = requests.get(url).text
document = lxml.html.document_fromstring(html)
remover.remove_ads(document)
clean_html = tostring(document).decode("utf-8")