使用 Parsel 选择器提取 class 名称的内容时绕过 em 标签
Bypass em tags when extracting contents of class name using Parsel selector
我正在尝试提取 class 名称的内容。如何提取所有内容,包括 'em' 标签内和 'em' 标签后的内容?见下图:
我尝试了以下方法,结果如下:
试验 1:
driver = webdriver.Chrome(options=options)
sel = Selector(text = driver.page_source)
sel.xpath("//*[@class ='st']").extract()
输出 1:
>> <span class="st"><span class="f">Nov 26, 2018 - </span>First #<em>GDPR fine</em> awarded in Germany. 330,000 user data stolen. Usernames and passwords stored in plaintext. €20,000 <em>fine</em>. Why "so low"?</span>
试验 2:
driver = webdriver.Chrome(options=options)
sel = Selector(text = driver.page_source)
sel.xpath("//*[@class ='st']/text()").extract()
输出 2:
>> First #
理想情况下,我想要得到的输出是:
>> Nov 26, 2018 - First #GDPR fine awarded in Germany. 330,000 user data stolen. Usernames and passwords stored in plaintext. €20,000 fine. Why "so low"?
我最终找到了解决问题的方法,虽然不是很优雅,但仍然欢迎更优雅的解决方案。
我使用
提取了 class 名称的内容
driver = webdriver.Chrome(options=options)
sel = Selector(text = driver.page_source)
content = sel.xpath("//*[@class ='st']").extract()
然后我定义了一个函数,将 html 从文本中剥离:
import html.parser
class HTMLTextExtractor(html.parser.HTMLParser):
def __init__(self):
super(HTMLTextExtractor, self).__init__()
self.result = [ ]
def handle_data(self, d):
self.result.append(d)
def get_text(self):
return ''.join(self.result)
def html_to_text(html):
s = HTMLTextExtractor()
s.feed(html)
return s.get_text()
遍历列表中的内容并一次剥离 html 一个给了我想要的结果:
m = []
for w in content:
z = html_to_text(w)
m.append(z)
我正在尝试提取 class 名称的内容。如何提取所有内容,包括 'em' 标签内和 'em' 标签后的内容?见下图:
试验 1:
driver = webdriver.Chrome(options=options)
sel = Selector(text = driver.page_source)
sel.xpath("//*[@class ='st']").extract()
输出 1:
>> <span class="st"><span class="f">Nov 26, 2018 - </span>First #<em>GDPR fine</em> awarded in Germany. 330,000 user data stolen. Usernames and passwords stored in plaintext. €20,000 <em>fine</em>. Why "so low"?</span>
试验 2:
driver = webdriver.Chrome(options=options)
sel = Selector(text = driver.page_source)
sel.xpath("//*[@class ='st']/text()").extract()
输出 2:
>> First #
理想情况下,我想要得到的输出是:
>> Nov 26, 2018 - First #GDPR fine awarded in Germany. 330,000 user data stolen. Usernames and passwords stored in plaintext. €20,000 fine. Why "so low"?
我最终找到了解决问题的方法,虽然不是很优雅,但仍然欢迎更优雅的解决方案。
我使用
提取了 class 名称的内容 driver = webdriver.Chrome(options=options)
sel = Selector(text = driver.page_source)
content = sel.xpath("//*[@class ='st']").extract()
然后我定义了一个函数,将 html 从文本中剥离:
import html.parser
class HTMLTextExtractor(html.parser.HTMLParser):
def __init__(self):
super(HTMLTextExtractor, self).__init__()
self.result = [ ]
def handle_data(self, d):
self.result.append(d)
def get_text(self):
return ''.join(self.result)
def html_to_text(html):
s = HTMLTextExtractor()
s.feed(html)
return s.get_text()
遍历列表中的内容并一次剥离 html 一个给了我想要的结果:
m = []
for w in content:
z = html_to_text(w)
m.append(z)