为什么 HTML 节点的文本在 HTML 解析器中为空?
Why is text of HTML node empty with HTMLParser?
在下面的示例中,我期望 Foo
用于 <h2>
文本:
from io import StringIO
from html5lib import HTMLParser
fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
''')
etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
h2 = etree.findall('.//h2')[0]
h2.text
不幸的是我得到 ''
。为什么?
奇怪的是,foo在文中:
>>> list(h2.itertext())
['1. ', 'Foo', '¶']
>>> h2.getchildren()
[<Element 'span' at 0x7fa54c6a1bd8>, <Element 'a' at 0x7fa54c6a1c78>]
>>> [node.text for node in h2.getchildren()]
['1. ', '¶']
那么 Foo
在哪里?
我认为你在树上的层次太浅了。试试这个:
from io import StringIO
from html5lib import HTMLParser
fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
''')
etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
etree.findall('.//h2')[0][0].tail
更一般地说,要抓取所有文本和尾巴,请尝试这样的循环:
for u in etree.findall('.//h2')[0]:
print(u.text, u.tail)
使用 lxml:
fp2 = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
'''
import lxml.html
tree = lxml.html.fromstring(fp2)
for item in tree.xpath('//h2'):
target = item.text_content().strip()
print(target.split('\n')[1].strip())
输出:
Foo
在下面的示例中,我期望 Foo
用于 <h2>
文本:
from io import StringIO
from html5lib import HTMLParser
fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
''')
etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
h2 = etree.findall('.//h2')[0]
h2.text
不幸的是我得到 ''
。为什么?
奇怪的是,foo在文中:
>>> list(h2.itertext())
['1. ', 'Foo', '¶']
>>> h2.getchildren()
[<Element 'span' at 0x7fa54c6a1bd8>, <Element 'a' at 0x7fa54c6a1c78>]
>>> [node.text for node in h2.getchildren()]
['1. ', '¶']
那么 Foo
在哪里?
我认为你在树上的层次太浅了。试试这个:
from io import StringIO
from html5lib import HTMLParser
fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
''')
etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
etree.findall('.//h2')[0][0].tail
更一般地说,要抓取所有文本和尾巴,请尝试这样的循环:
for u in etree.findall('.//h2')[0]:
print(u.text, u.tail)
使用 lxml:
fp2 = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
'''
import lxml.html
tree = lxml.html.fromstring(fp2)
for item in tree.xpath('//h2'):
target = item.text_content().strip()
print(target.split('\n')[1].strip())
输出:
Foo