日本字符搞砸了 lxml 解析
Japanese characters screwing up lxml parsing
我将如何在 lxml 中执行以下操作?
runtime_text = node.xpath("//dl/dt[text()=u'Runtime:' or text()=u'Laufzeit:' or text()=u'再生時間:']/following-sibling::dd")[0].text.strip()
没有汉字也能正常工作,但一旦添加该行,它就会失败:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 1498, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:52102)
File "xpath.pxi", line 295, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:151816)
File "apihelpers.pxi", line 1393, in lxml.etree._utf8 (src/lxml/lxml.etree.c:27087)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
我想你想要:
runtime_text = node.xpath(u"//dl/dt[text()='Runtime:' or text()='Laufzeit:' or text()='再生時間:']/following-sibling::dd")[0].text.strip()
lxml 可能不理解 python 的 unicode 文字
我将如何在 lxml 中执行以下操作?
runtime_text = node.xpath("//dl/dt[text()=u'Runtime:' or text()=u'Laufzeit:' or text()=u'再生時間:']/following-sibling::dd")[0].text.strip()
没有汉字也能正常工作,但一旦添加该行,它就会失败:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 1498, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:52102)
File "xpath.pxi", line 295, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:151816)
File "apihelpers.pxi", line 1393, in lxml.etree._utf8 (src/lxml/lxml.etree.c:27087)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
我想你想要:
runtime_text = node.xpath(u"//dl/dt[text()='Runtime:' or text()='Laufzeit:' or text()='再生時間:']/following-sibling::dd")[0].text.strip()
lxml 可能不理解 python 的 unicode 文字