Python 通过 Xpath 分组的 Scrapy 动态项目
Python Scrapy dynamic item with grouping by Xpath
我的页面如下
<div style="width:100%;" id="innerTSpec">
<table width="100%" cellpadding="0" cellspacing="0" class="PrintIE7in80PercentWidth PrintIE6in80PercentWidth">
<tr><td ></td><td class="techspecheading"> Header1</td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute1: </td><td width="10px"></td><td class="techspecdata"> Value1 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute2: </td><td width="10px"></td><td class="techspecdata"> Value2 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
---> <tr><td ></td><td class="techspecheading"> <hr></td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> Header2</td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute3: </td><td width="10px"></td><td class="techspecdata"> More Value1 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute4: </td><td width="10px"></td><td class="techspecdata"> More Value2 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute5: </td><td width="10px"></td><td class="techspecdata"> More Value3 </td></tr>
---> <tr><td ></td><td class="techspecheading"> <hr></td></tr>
</table>
</div>
Header 和Attributes 位置不固定,每次随页面变化。
我正在尝试如下所示:
Header1 | Header2 |...
----------------------------------------------
My Attribute1:Value1|My Attribute3:More Value1|...
My Attribute2:Value2|My Attribute4:More Value2|...
|My Attribute5:More Value3|...
注意:我正在使用动态项目,这些项目将像
My Item is as below
--------------------------------------
class Website(Item):
def __setitem__(self, key, value):
if key not in self.fields:
self.fields[key] = Field()
self._values[key] = value
--------------------------------------
and in spider adding as below
--------------------------------------
item[Heading]=Body.xpath('..........').extract()
我没有安装 scrapy,但我认为您可以轻松修改它以使用 scrapy 的 Items
。
from lxml.html import fromstring
html = """
<div style="width:100%;" id="innerTSpec">
<table width="100%" cellpadding="0" cellspacing="0" class="PrintIE7in80PercentWidth PrintIE6in80PercentWidth">
<tr><td ></td><td class="techspecheading"> Header1</td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute1: </td><td width="10px"></td><td class="techspecdata"> Value1 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute2: </td><td width="10px"></td><td class="techspecdata"> Value2 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
---> <tr><td ></td><td class="techspecheading"> <hr></td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> Header2</td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute3: </td><td width="10px"></td><td class="techspecdata"> More Value1 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute4: </td><td width="10px"></td><td class="techspecdata"> More Value2 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute5: </td><td width="10px"></td><td class="techspecdata"> More Value3 </td></tr>
---> <tr><td ></td><td class="techspecheading"> <hr></td></tr>
</table>
</div>
"""
body = fromstring(html)
heading = None
item = {}
for tr in body.xpath(r'//div[@id="innerTSpec"]//tr'):
# Extract row data. Skip rows without data.
data = tr.xpath(r'.//td[@class]/text()')
data = list(filter(None, [txt.strip() for txt in data]))
if not data:
continue
# Populate item.
if len(data) == 1:
heading = data[0]
else:
item.setdefault(heading, []).append(''.join(data))
print(item)
item
:
{
'Header1': ['My Attribute1:Value1', 'My Attribute2:Value2'],
'Header2': ['My Attribute3:More Value1', 'My Attribute4:More Value2', 'My Attribute5:More Value3']
}
我的页面如下
<div style="width:100%;" id="innerTSpec">
<table width="100%" cellpadding="0" cellspacing="0" class="PrintIE7in80PercentWidth PrintIE6in80PercentWidth">
<tr><td ></td><td class="techspecheading"> Header1</td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute1: </td><td width="10px"></td><td class="techspecdata"> Value1 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute2: </td><td width="10px"></td><td class="techspecdata"> Value2 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
---> <tr><td ></td><td class="techspecheading"> <hr></td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> Header2</td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute3: </td><td width="10px"></td><td class="techspecdata"> More Value1 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute4: </td><td width="10px"></td><td class="techspecdata"> More Value2 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute5: </td><td width="10px"></td><td class="techspecdata"> More Value3 </td></tr>
---> <tr><td ></td><td class="techspecheading"> <hr></td></tr>
</table>
</div>
Header 和Attributes 位置不固定,每次随页面变化。 我正在尝试如下所示:
Header1 | Header2 |...
----------------------------------------------
My Attribute1:Value1|My Attribute3:More Value1|...
My Attribute2:Value2|My Attribute4:More Value2|...
|My Attribute5:More Value3|...
注意:我正在使用动态项目,这些项目将像
My Item is as below
--------------------------------------
class Website(Item):
def __setitem__(self, key, value):
if key not in self.fields:
self.fields[key] = Field()
self._values[key] = value
--------------------------------------
and in spider adding as below
--------------------------------------
item[Heading]=Body.xpath('..........').extract()
我没有安装 scrapy,但我认为您可以轻松修改它以使用 scrapy 的 Items
。
from lxml.html import fromstring
html = """
<div style="width:100%;" id="innerTSpec">
<table width="100%" cellpadding="0" cellspacing="0" class="PrintIE7in80PercentWidth PrintIE6in80PercentWidth">
<tr><td ></td><td class="techspecheading"> Header1</td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute1: </td><td width="10px"></td><td class="techspecdata"> Value1 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute2: </td><td width="10px"></td><td class="techspecdata"> Value2 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
---> <tr><td ></td><td class="techspecheading"> <hr></td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> Header2</td></tr>
<tr><td ></td><td class="techspecdata"> </td><td width="10px"></td><td class=""> </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute3: </td><td width="10px"></td><td class="techspecdata"> More Value1 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute4: </td><td width="10px"></td><td class="techspecdata"> More Value2 </td></tr>
<tr><td ></td><td class="techspecheading"> </td></tr>
<tr><td ></td><td class="techspecdata"> My Attribute5: </td><td width="10px"></td><td class="techspecdata"> More Value3 </td></tr>
---> <tr><td ></td><td class="techspecheading"> <hr></td></tr>
</table>
</div>
"""
body = fromstring(html)
heading = None
item = {}
for tr in body.xpath(r'//div[@id="innerTSpec"]//tr'):
# Extract row data. Skip rows without data.
data = tr.xpath(r'.//td[@class]/text()')
data = list(filter(None, [txt.strip() for txt in data]))
if not data:
continue
# Populate item.
if len(data) == 1:
heading = data[0]
else:
item.setdefault(heading, []).append(''.join(data))
print(item)
item
:
{
'Header1': ['My Attribute1:Value1', 'My Attribute2:Value2'],
'Header2': ['My Attribute3:More Value1', 'My Attribute4:More Value2', 'My Attribute5:More Value3']
}