解析 lxml 中的 HTML 树:如何检索元素内的文本?
Parsing HTML tree in lxml : how can I retrieve the text inside the element?
我正在尝试检索元素内的正确文本。这是输出:
(Pdb) p etree.tostring(els[0])
'<h5 class="msg-delivered" style="padding:0;text-rendering:optimizeLegibility;line-height:1.1;margin-bottom:15px;-webkit-font-smoothing:antialiased;font-family:"Open Sans", "Helvetica Neue", Arial, Helvetica, sans-serif;color:#888888;vertical-align:middle;margin:0;font-size:13px;font-weight:300 !important"> \n<i class="ic-icon-delivered" style="margin:0;padding:0;font-family:"Open Sans", "Helvetica Neue", "Helvetica", Helvetica, Arial, sans-serif;text-rendering:optimizeLegibility;position:relative;background:url(https://d1s8987jlndkbs.cloudfront.net/assets/sprite-ratings-ee0696744f54df6536179c70e24217e3.png) no-repeat -12px -12px;background-size:132px 436px;display:none;vertical-align:middle;width:25px;height:25px;background-position:-16px -16px;top:0"/> \nYour order was delivered \non \n6/4 \n@ \n4:44 PM \n</h5> \n'
(Pdb) p els[0].text
'\r\n'
如何获取字符串:"Your item was delivered on 6/4 at 4:40 PM"?我可以在 etree.tostring() 输出上使用正则表达式,但想知道为什么 els[0].text 选项不起作用?
您可以尝试使用 xpath 函数 string()
,它 return 连接了当前元素中所有文本节点的值:
import lxml.html
html = """<h5 class="msg-delivered" style="padding:0;text-rendering:optimizeLegibility;line-height:1.1;margin-bottom:15px;-webkit-font-smoothing:antialiased;font-family:"Open Sans", "Helvetica Neue", Arial, Helvetica, sans-serif;color:#888888;vertical-align:middle;margin:0;font-size:13px;font-weight:300 !important"> \n<i class="ic-icon-delivered" style="margin:0;padding:0;font-family:"Open Sans", "Helvetica Neue", "Helvetica", Helvetica, Arial, sans-serif;text-rendering:optimizeLegibility;position:relative;background:url(https://d1s8987jlndkbs.cloudfront.net/assets/sprite-ratings-ee0696744f54df6536179c70e24217e3.png) no-repeat -12px -12px;background-size:132px 436px;display:none;vertical-align:middle;width:25px;height:25px;background-position:-16px -16px;top:0"/> \nYour order was delivered \non \n6/4 \n@ \n4:44 PM \n</h5>"""
tree = lxml.html.etee.fromstring(html)
print(tree.xpath("string()"))
输出:
'\r\n\r\nYour order was delivered\r\non\r\n6/4\r\n@\r\n4:44 PM\r\n'
如果你想要所有的文字,你可以简单地使用:
els[0].text_content()
也就是说,假设您加载了 html:
import lxml.html
html = """<h5 class="msg-delivered" style="padding:0;text-rendering:optimizeLegibility;line-height:1.1;margin-bottom:15px;-webkit-font-smoothing:antialiased;font-family:"Open Sans", "Helvetica Neue", Arial, Helvetica, sans-serif;color:#888888;vertical-align:middle;margin:0;font-size:13px;font-weight:300 !important"> \n<i class="ic-icon-delivered" style="margin:0;padding:0;font-family:"Open Sans", "Helvetica Neue", "Helvetica", Helvetica, Arial, sans-serif;text-rendering:optimizeLegibility;position:relative;background:url(https://d1s8987jlndkbs.cloudfront.net/assets/sprite-ratings-ee0696744f54df6536179c70e24217e3.png) no-repeat -12px -12px;background-size:132px 436px;display:none;vertical-align:middle;width:25px;height:25px;background-position:-16px -16px;top:0"/> \nYour order was delivered \non \n6/4 \n@ \n4:44 PM \n</h5>"""
tree = lxml.html.fromstring(html)
请注意,您可能希望避免使用 lxml.html.etree.fromstring,而只需使用 lxml.html.fromstring
我正在尝试检索元素内的正确文本。这是输出:
(Pdb) p etree.tostring(els[0])
'<h5 class="msg-delivered" style="padding:0;text-rendering:optimizeLegibility;line-height:1.1;margin-bottom:15px;-webkit-font-smoothing:antialiased;font-family:"Open Sans", "Helvetica Neue", Arial, Helvetica, sans-serif;color:#888888;vertical-align:middle;margin:0;font-size:13px;font-weight:300 !important"> \n<i class="ic-icon-delivered" style="margin:0;padding:0;font-family:"Open Sans", "Helvetica Neue", "Helvetica", Helvetica, Arial, sans-serif;text-rendering:optimizeLegibility;position:relative;background:url(https://d1s8987jlndkbs.cloudfront.net/assets/sprite-ratings-ee0696744f54df6536179c70e24217e3.png) no-repeat -12px -12px;background-size:132px 436px;display:none;vertical-align:middle;width:25px;height:25px;background-position:-16px -16px;top:0"/> \nYour order was delivered \non \n6/4 \n@ \n4:44 PM \n</h5> \n'
(Pdb) p els[0].text
'\r\n'
如何获取字符串:"Your item was delivered on 6/4 at 4:40 PM"?我可以在 etree.tostring() 输出上使用正则表达式,但想知道为什么 els[0].text 选项不起作用?
您可以尝试使用 xpath 函数 string()
,它 return 连接了当前元素中所有文本节点的值:
import lxml.html
html = """<h5 class="msg-delivered" style="padding:0;text-rendering:optimizeLegibility;line-height:1.1;margin-bottom:15px;-webkit-font-smoothing:antialiased;font-family:"Open Sans", "Helvetica Neue", Arial, Helvetica, sans-serif;color:#888888;vertical-align:middle;margin:0;font-size:13px;font-weight:300 !important"> \n<i class="ic-icon-delivered" style="margin:0;padding:0;font-family:"Open Sans", "Helvetica Neue", "Helvetica", Helvetica, Arial, sans-serif;text-rendering:optimizeLegibility;position:relative;background:url(https://d1s8987jlndkbs.cloudfront.net/assets/sprite-ratings-ee0696744f54df6536179c70e24217e3.png) no-repeat -12px -12px;background-size:132px 436px;display:none;vertical-align:middle;width:25px;height:25px;background-position:-16px -16px;top:0"/> \nYour order was delivered \non \n6/4 \n@ \n4:44 PM \n</h5>"""
tree = lxml.html.etee.fromstring(html)
print(tree.xpath("string()"))
输出:
'\r\n\r\nYour order was delivered\r\non\r\n6/4\r\n@\r\n4:44 PM\r\n'
如果你想要所有的文字,你可以简单地使用:
els[0].text_content()
也就是说,假设您加载了 html:
import lxml.html
html = """<h5 class="msg-delivered" style="padding:0;text-rendering:optimizeLegibility;line-height:1.1;margin-bottom:15px;-webkit-font-smoothing:antialiased;font-family:"Open Sans", "Helvetica Neue", Arial, Helvetica, sans-serif;color:#888888;vertical-align:middle;margin:0;font-size:13px;font-weight:300 !important"> \n<i class="ic-icon-delivered" style="margin:0;padding:0;font-family:"Open Sans", "Helvetica Neue", "Helvetica", Helvetica, Arial, sans-serif;text-rendering:optimizeLegibility;position:relative;background:url(https://d1s8987jlndkbs.cloudfront.net/assets/sprite-ratings-ee0696744f54df6536179c70e24217e3.png) no-repeat -12px -12px;background-size:132px 436px;display:none;vertical-align:middle;width:25px;height:25px;background-position:-16px -16px;top:0"/> \nYour order was delivered \non \n6/4 \n@ \n4:44 PM \n</h5>"""
tree = lxml.html.fromstring(html)
请注意,您可能希望避免使用 lxml.html.etree.fromstring,而只需使用 lxml.html.fromstring