使用 scrapy 从 html 源中删除不必要的标签内容
Remove unnecessary tag content from html source using scrapy
我正在使用 scrapy 提取网页的 html 源并将输出保存为 .xml 格式。网页源码内容如下
<html>
<head>
<script type="text/javascript">var startTime = new Date().getTime();
</script><script type="text/javascript">var startTime = new
Date().getTime(); </script> <script type="text/javascript">
document.cookie = "jsEnabled=true";..........
...........<div style="margin: 0px">Required content</div>
</head>
</html>
我需要从中删除所有
<script>....</script>
tags 并用各自的标签保留所需的内容。
我如何使用 scrapy 做到这一点?
我建议您使用 lxml
包来删除元素。
import lxml.etree as et
from lxml.etree import HTMLParser
from StringIO import StringIO
def parse(self, response):
parser = HTMLParser(encoding='utf-8', recover=True)
tree = et.parse(StringIO(response.body), parser)
for element in tree.xpath('//script'):
element.getparent().remove(element)
print et.tostring(tree, pretty_print=True, xml_declaration=True)
以下代码删除了文本中的 1 div。
from bs4 import BeautifulSoup
from bs4.element import Tag
markup = '<a>This is not div <div class="1">This is div 1</div><div class="2">This is div 2</div></a>'
soup = BeautifulSoup(markup,"html.parser")
for tag in soup.select('div.1'):
tag.decompose()
print(soup)
输出:
<a>This is not div <div class="2">This is div 2</div></a>
我正在使用 scrapy 提取网页的 html 源并将输出保存为 .xml 格式。网页源码内容如下
<html>
<head>
<script type="text/javascript">var startTime = new Date().getTime();
</script><script type="text/javascript">var startTime = new
Date().getTime(); </script> <script type="text/javascript">
document.cookie = "jsEnabled=true";..........
...........<div style="margin: 0px">Required content</div>
</head>
</html>
我需要从中删除所有
<script>....</script>
tags 并用各自的标签保留所需的内容。 我如何使用 scrapy 做到这一点?
我建议您使用 lxml
包来删除元素。
import lxml.etree as et
from lxml.etree import HTMLParser
from StringIO import StringIO
def parse(self, response):
parser = HTMLParser(encoding='utf-8', recover=True)
tree = et.parse(StringIO(response.body), parser)
for element in tree.xpath('//script'):
element.getparent().remove(element)
print et.tostring(tree, pretty_print=True, xml_declaration=True)
以下代码删除了文本中的 1 div。
from bs4 import BeautifulSoup
from bs4.element import Tag
markup = '<a>This is not div <div class="1">This is div 1</div><div class="2">This is div 2</div></a>'
soup = BeautifulSoup(markup,"html.parser")
for tag in soup.select('div.1'):
tag.decompose()
print(soup)
输出:
<a>This is not div <div class="2">This is div 2</div></a>