lxml xpath - 获取 span 标签内的所有文本
lxml xpath - Get all text within span tags
我正在尝试抓取看起来像这样的页面,每组有 3 个或更多 span 标签。目标是获取字典列表 ex:
{'ctl02_lblAppearanceInfo1': 'Text',
'ctl02_lblAppearanceInfo2': 'Text'}
html:
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE.............. </span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE..........</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE..........</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE..........</span>
我用过
tree.xpath('//span[starts-with(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl")]')
成功,因为它 return 是一个具有 id 和文本属性的元素对象,但是如果我遇到这样的事情:
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText">
TEXT LINE 1
<br>TEXT LINE 2
<br>TEXT LINE 3
<br>TEXT LINE 4</span>
只会return返回"TEXT LINE 1"
使用contains()
和text()
.
代码如下:
from lxml import html
HTML = """<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE 1.............. </span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE 2..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE 3..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE 4..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE 5..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE 6..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE 7..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE 8..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE 9..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText">
TEXT LINE 10.............
<br>TEXT LINE 11.............
<br>TEXT LINE 12.............
<br>TEXT LINE 13.............</span>
"""
tree = html.fromstring(HTML)
text_lines = tree.xpath('//span[contains(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl")]')
results = dict()
for i, text_line in enumerate(text_lines):
span_id = text_line.xpath('.//@id')[0]
span_text = [x.strip() for x in text_line.xpath('.//text()')]
results[i] = dict(id=span_id, texts=span_text)
print results
输出:
{
0: {
'texts': ['TEXT HERE 1..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1'
},
1: {
'texts': ['TEXT HERE 2..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2'
},
2: {
'texts': ['TEXT HERE 3..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace'
},
3: {
'texts': ['TEXT HERE 4..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1'
},
4: {
'texts': ['TEXT HERE 5..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2'
},
5: {
'texts': ['TEXT HERE 6..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace'
},
6: {
'texts': ['TEXT HERE 7..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1'
},
7: {
'texts': ['TEXT HERE 8..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2'
},
8: {
'texts': ['TEXT HERE 9..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace'
},
9: {
'texts': ['TEXT LINE 10.............', 'TEXT LINE 11.............', 'TEXT LINE 12.............', 'TEXT LINE 13.............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1'
}
}
我正在尝试抓取看起来像这样的页面,每组有 3 个或更多 span 标签。目标是获取字典列表 ex:
{'ctl02_lblAppearanceInfo1': 'Text',
'ctl02_lblAppearanceInfo2': 'Text'}
html:
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE.............. </span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE..........</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE..........</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE..........</span>
我用过
tree.xpath('//span[starts-with(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl")]')
成功,因为它 return 是一个具有 id 和文本属性的元素对象,但是如果我遇到这样的事情:
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText">
TEXT LINE 1
<br>TEXT LINE 2
<br>TEXT LINE 3
<br>TEXT LINE 4</span>
只会return返回"TEXT LINE 1"
使用contains()
和text()
.
代码如下:
from lxml import html
HTML = """<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE 1.............. </span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE 2..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE 3..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE 4..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE 5..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE 6..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE 7..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE 8..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE 9..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText">
TEXT LINE 10.............
<br>TEXT LINE 11.............
<br>TEXT LINE 12.............
<br>TEXT LINE 13.............</span>
"""
tree = html.fromstring(HTML)
text_lines = tree.xpath('//span[contains(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl")]')
results = dict()
for i, text_line in enumerate(text_lines):
span_id = text_line.xpath('.//@id')[0]
span_text = [x.strip() for x in text_line.xpath('.//text()')]
results[i] = dict(id=span_id, texts=span_text)
print results
输出:
{
0: {
'texts': ['TEXT HERE 1..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1'
},
1: {
'texts': ['TEXT HERE 2..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2'
},
2: {
'texts': ['TEXT HERE 3..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace'
},
3: {
'texts': ['TEXT HERE 4..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1'
},
4: {
'texts': ['TEXT HERE 5..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2'
},
5: {
'texts': ['TEXT HERE 6..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace'
},
6: {
'texts': ['TEXT HERE 7..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1'
},
7: {
'texts': ['TEXT HERE 8..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2'
},
8: {
'texts': ['TEXT HERE 9..............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace'
},
9: {
'texts': ['TEXT LINE 10.............', 'TEXT LINE 11.............', 'TEXT LINE 12.............', 'TEXT LINE 13.............'],
'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1'
}
}