Python 通过 Xpath 分组的 Scrapy 动态项目

Python Scrapy dynamic item with grouping by Xpath

我的页面如下

<div style="width:100%;" id="innerTSpec">
        <table width="100%" cellpadding="0" cellspacing="0" class="PrintIE7in80PercentWidth PrintIE6in80PercentWidth">
            <tr><td ></td><td  class="techspecheading">    Header1</td></tr>
            <tr><td ></td><td  class="techspecdata">    </td><td width="10px"></td><td class="">        </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute1: </td><td width="10px"></td><td class="techspecdata">    Value1    </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute2: </td><td width="10px"></td><td class="techspecdata">    Value2     </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
--->        <tr><td ></td><td  class="techspecheading">    <hr></td></tr>
            <tr><td ></td><td  class="techspecdata">    </td><td width="10px"></td><td class="">        </td></tr>
            <tr><td ></td><td  class="techspecheading">   Header2</td></tr>
            <tr><td ></td><td  class="techspecdata">    </td><td width="10px"></td><td class="">        </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute3: </td><td width="10px"></td><td class="techspecdata">    More Value1     </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute4: </td><td width="10px"></td><td class="techspecdata">    More Value2     </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">   My Attribute5: </td><td width="10px"></td><td class="techspecdata">    More Value3     </td></tr>
--->        <tr><td ></td><td  class="techspecheading">    <hr></td></tr>
            
        </table>
    </div>

Header 和Attributes 位置不固定,每次随页面变化。 我正在尝试如下所示:

Header1             | Header2                 |...
----------------------------------------------
My Attribute1:Value1|My Attribute3:More Value1|...
My Attribute2:Value2|My Attribute4:More Value2|...
                    |My Attribute5:More Value3|...

注意:我正在使用动态项目,这些项目将像

My Item is as below
--------------------------------------
class Website(Item):
    def __setitem__(self, key, value):
        if key not in self.fields:
            self.fields[key] = Field()
        self._values[key] = value
--------------------------------------
and in spider adding as below
--------------------------------------
item[Heading]=Body.xpath('..........').extract()

我没有安装 scrapy,但我认为您可以轻松修改它以使用 scrapy 的 Items

from lxml.html import fromstring


html = """
<div style="width:100%;" id="innerTSpec">
        <table width="100%" cellpadding="0" cellspacing="0" class="PrintIE7in80PercentWidth PrintIE6in80PercentWidth">
            <tr><td ></td><td  class="techspecheading">    Header1</td></tr>
            <tr><td ></td><td  class="techspecdata">    </td><td width="10px"></td><td class="">        </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute1: </td><td width="10px"></td><td class="techspecdata">    Value1    </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute2: </td><td width="10px"></td><td class="techspecdata">    Value2     </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
--->        <tr><td ></td><td  class="techspecheading">    <hr></td></tr>
            <tr><td ></td><td  class="techspecdata">    </td><td width="10px"></td><td class="">        </td></tr>
            <tr><td ></td><td  class="techspecheading">   Header2</td></tr>
            <tr><td ></td><td  class="techspecdata">    </td><td width="10px"></td><td class="">        </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute3: </td><td width="10px"></td><td class="techspecdata">    More Value1     </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">    My Attribute4: </td><td width="10px"></td><td class="techspecdata">    More Value2     </td></tr>
            <tr><td ></td><td  class="techspecheading">    </td></tr>
            <tr><td ></td><td  class="techspecdata">   My Attribute5: </td><td width="10px"></td><td class="techspecdata">    More Value3     </td></tr>
--->        <tr><td ></td><td  class="techspecheading">    <hr></td></tr>
        </table>
    </div>
"""
body = fromstring(html)

heading = None
item = {}
for tr in body.xpath(r'//div[@id="innerTSpec"]//tr'):
    # Extract row data. Skip rows without data.
    data = tr.xpath(r'.//td[@class]/text()')
    data = list(filter(None, [txt.strip() for txt in data]))
    if not data:
        continue

    # Populate item.  
    if len(data) == 1:
        heading = data[0]
    else:
        item.setdefault(heading, []).append(''.join(data))
print(item)

item:

{
    'Header1': ['My Attribute1:Value1', 'My Attribute2:Value2'],
    'Header2': ['My Attribute3:More Value1', 'My Attribute4:More Value2', 'My Attribute5:More Value3']
}