如何在没有 HTML 标记的情况下 select 文本
How to select text without the HTML markup
我正在开发网络抓取工具(使用 Python),所以我有一大块 HTML 试图从中提取文本。其中一个片段看起来像这样:
<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>
我想从此 class 中提取文本。现在,我可以使用类似
的东西
//p[@class='something')]//text()
但这会导致每个文本块最终成为一个单独的结果元素,如下所示:
(This class has some ,text, and a few ,links, in it.)
所需的输出将包含一个元素中的所有文本,如下所示:
This class has some text and a few links in it.
有没有简单或优雅的方法来实现这一点?
编辑:这是生成上述结果的代码。
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']//text()"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item)
您可以在 XPath 中使用 normalize-space()
。那么
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"
tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)
会产生
This class has some text and a few links in it.
您可以在 lxml 元素上调用 .text_content()
,而不是使用 XPath 获取文本。
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item.text_content())
您原始代码的替代单行代码:使用 join
和空字符串分隔符:
print("".join(query_results))
我正在开发网络抓取工具(使用 Python),所以我有一大块 HTML 试图从中提取文本。其中一个片段看起来像这样:
<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>
我想从此 class 中提取文本。现在,我可以使用类似
的东西//p[@class='something')]//text()
但这会导致每个文本块最终成为一个单独的结果元素,如下所示:
(This class has some ,text, and a few ,links, in it.)
所需的输出将包含一个元素中的所有文本,如下所示:
This class has some text and a few links in it.
有没有简单或优雅的方法来实现这一点?
编辑:这是生成上述结果的代码。
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']//text()"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item)
您可以在 XPath 中使用 normalize-space()
。那么
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"
tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)
会产生
This class has some text and a few links in it.
您可以在 lxml 元素上调用 .text_content()
,而不是使用 XPath 获取文本。
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item.text_content())
您原始代码的替代单行代码:使用 join
和空字符串分隔符:
print("".join(query_results))