使用 LXML xml 设置没有值的属性
Set an attribute without a value with LXML xml
我要:
<div data-a>
但是 LXML API 似乎只给我这个:
<div data-a=''>
如何获得无价值的属性?
LXML 将空白值和空值表示为空白字符串,这很烦人。
设置 None 值没有帮助。
In [19]: from lxml.html import fromstring, tostring
In [20]: b = fromstring('<body class="meow" data-a="haha" data-b data-x="">text-fef27e87389e466fb99b5421629323f6</body>')
In [21]: b.attrib
Out[21]: {'data-a': 'haha', 'data-x': '', 'data-b': '', 'class': 'meow'}
In [22]: b = fromstring('<body class="meow" data-a="haha" data-b data-x="">text-fef27e87389e466fb99b5421629323f6</body>')
In [23]: b.attrib
Out[23]: {'data-a': 'haha', 'data-x': '', 'data-b': '', 'class': 'meow'}
In [24]: b.attrib['data-y'] = None
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-1f55133e3dc4> in <module>()
----> 1 b.attrib['data-y'] = None
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Attrib.__setitem__ (src/lxml/lxml.etree.c:58775)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:19025)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._utf8 (src/lxml/lxml.etree.c:26460)()
TypeError: Argument must be bytes or unicode, got 'NoneType'
tag.attrib['data-a'] = None
TypeError: Argument must be bytes or unicode, got 'NoneType'
恕我直言,lxml
正在展示预期的行为。没有值的属性会导致格式不正确 XML,而体面的 XML 解析器不会生成格式不正确的 XML :
- 关于 XML 中没有值的属性:Is an xml attribute without a value, valid?
- 关于术语 格式正确 XML : Is there a difference between 'valid xml' and 'well formed xml'?
看起来您实际上是在尝试操纵 HTML 而不是 XML。如果是这样,则使用 lxml.html 而不是 lxml.etree。
您正在尝试设置一个 "boolean attribute",不要与 "boolean value" 混淆(参见 boolean-attributes)。正如另一个答案中已经指出的那样,布尔属性语法不是 allowed.e
但是,由于您试图操纵 HTML 似乎很明显,因此您创建了一个带有 HTML 元素而不是 XML 元素的布尔属性。
import unittest
import lxml.html
class HtmlBooleanAttribute(unittest.TestCase):
def test_booleanAttribute(self):
# !!! BE SURE TO CREATE AN ****HTML**** ELEMENT !!!
div = lxml.html.Element('div')
# Set a boolean attribute; omitting the value or providing None will
# create a boolean attribute.
div.set('data-a')
div.set('data-b', None)
# Setting the value to an empty will not give you a boolean attribute
div.set('data-c', '')
# Set a normal attribute for comparison
div.set('class','big red')
print
print lxml.html.tostring(div)
print
# Note that 'data-a' will be a zero-length string
print 'data-a = ', div.get('data-a')
print 'type(data-a) = ', type(div.get('data-a'))
print 'len(data-a) = ', len(div.get('data-a'))
print
print 'data-c = ', div.get('data-c')
print 'type(data-c) = ', type(div.get('data-c'))
print 'len(data-c) = ', len(div.get('data-c'))
if __name__ == "__main__":
#import sys;sys.argv = ['', 'Test.testName']
unittest.main()
输出
<div data-a data-b data-c="" class="big red"></div>
data-a =
type(data-a) = <type 'str'>
len(data-a) = 0
data-c =
type(data-c) = <type 'str'>
len(data-c) = 0
.
----------------------------------------------------------------------
Ran 1 test in 0.000s
OK
请注意,data-a 和 data-b 都是零长度字符串,但它们的打印方式不同。
我要:
<div data-a>
但是 LXML API 似乎只给我这个:
<div data-a=''>
如何获得无价值的属性?
LXML 将空白值和空值表示为空白字符串,这很烦人。
设置 None 值没有帮助。
In [19]: from lxml.html import fromstring, tostring
In [20]: b = fromstring('<body class="meow" data-a="haha" data-b data-x="">text-fef27e87389e466fb99b5421629323f6</body>')
In [21]: b.attrib
Out[21]: {'data-a': 'haha', 'data-x': '', 'data-b': '', 'class': 'meow'}
In [22]: b = fromstring('<body class="meow" data-a="haha" data-b data-x="">text-fef27e87389e466fb99b5421629323f6</body>')
In [23]: b.attrib
Out[23]: {'data-a': 'haha', 'data-x': '', 'data-b': '', 'class': 'meow'}
In [24]: b.attrib['data-y'] = None
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-24-1f55133e3dc4> in <module>()
----> 1 b.attrib['data-y'] = None
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._Attrib.__setitem__ (src/lxml/lxml.etree.c:58775)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:19025)()
/usr/lib/python2.7/dist-packages/lxml/etree.so in lxml.etree._utf8 (src/lxml/lxml.etree.c:26460)()
TypeError: Argument must be bytes or unicode, got 'NoneType'
tag.attrib['data-a'] = None
TypeError: Argument must be bytes or unicode, got 'NoneType'
恕我直言,lxml
正在展示预期的行为。没有值的属性会导致格式不正确 XML,而体面的 XML 解析器不会生成格式不正确的 XML :
- 关于 XML 中没有值的属性:Is an xml attribute without a value, valid?
- 关于术语 格式正确 XML : Is there a difference between 'valid xml' and 'well formed xml'?
看起来您实际上是在尝试操纵 HTML 而不是 XML。如果是这样,则使用 lxml.html 而不是 lxml.etree。
您正在尝试设置一个 "boolean attribute",不要与 "boolean value" 混淆(参见 boolean-attributes)。正如另一个答案中已经指出的那样,布尔属性语法不是 allowed.e
但是,由于您试图操纵 HTML 似乎很明显,因此您创建了一个带有 HTML 元素而不是 XML 元素的布尔属性。
import unittest
import lxml.html
class HtmlBooleanAttribute(unittest.TestCase):
def test_booleanAttribute(self):
# !!! BE SURE TO CREATE AN ****HTML**** ELEMENT !!!
div = lxml.html.Element('div')
# Set a boolean attribute; omitting the value or providing None will
# create a boolean attribute.
div.set('data-a')
div.set('data-b', None)
# Setting the value to an empty will not give you a boolean attribute
div.set('data-c', '')
# Set a normal attribute for comparison
div.set('class','big red')
print
print lxml.html.tostring(div)
print
# Note that 'data-a' will be a zero-length string
print 'data-a = ', div.get('data-a')
print 'type(data-a) = ', type(div.get('data-a'))
print 'len(data-a) = ', len(div.get('data-a'))
print
print 'data-c = ', div.get('data-c')
print 'type(data-c) = ', type(div.get('data-c'))
print 'len(data-c) = ', len(div.get('data-c'))
if __name__ == "__main__":
#import sys;sys.argv = ['', 'Test.testName']
unittest.main()
输出
<div data-a data-b data-c="" class="big red"></div>
data-a =
type(data-a) = <type 'str'>
len(data-a) = 0
data-c =
type(data-c) = <type 'str'>
len(data-c) = 0
.
----------------------------------------------------------------------
Ran 1 test in 0.000s
OK
请注意,data-a 和 data-b 都是零长度字符串,但它们的打印方式不同。