Html5 find/parse 页中的特定元素 python

Html5 find/parse specific element in page python

我正在尝试学习如何 find/parse 来自 html5 网页的数据以在数据库中使用。我想学习如何 find/parse 仅来自第一个 '//div[@class="col-xs-12 col-sm-6 col-md-4 col-lg-3"]'

的数据

我试过 html5lib,从 lxml 导入 html 和 xpath 但是缺少我的特定用途的文档令人沮丧,无法真正找到如何实现这一目标。

要查找和存储的数据:

http://csgo.steamanalyst.com/id/120565/ 
from <span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/'

And the 2 numbers from "addToCart(1852864,1108)" as id1:'1852864' and id2:'1108'

in <button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'

我正在尝试学习的 html 代码

<!DOCTYPE html> 

<div class='row'><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852864'>StatTrak&#8482; Desert Eagle | Conspiracy (Factory New)</a><br /><small class='text-muted'>StatTrak&#8482; Classified Pistol</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>1,108</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>1,451</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&StatTrak=1&search_item=+Desert+Eagle+%7C+Conspiracy+%28Factory+New%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1841001'>★ Karambit | Doppler (Factory New)</a><br /><small class='text-muted'>★ Covert Knife</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>155,000</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/62403692/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>30,300</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=%E2%98%85+Karambit+%7C+Doppler+%28Factory+New%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem2' onclick='addToCart(1841001,155000)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852853'>AK-47 | Redline (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>441</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/1420/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>520</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=AK-47+%7C+Redline+%28Field-Tested%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem3' onclick='addToCart(1852853,441)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852846'>M4A1-S | Master Piece (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>6,618</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120409/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>8,905</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=M4A1-S+%7C+Master+Piece+%28Field-Tested%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem4' onclick='addToCart(1852846,6618)'>Add to cart</button></center></div>
    </div>

我使用了一个更简单的库来解决类似的问题:

import re
from HTMLParser import HTMLParser

class MyParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.in_market = 0
    self.markets = {}
    self.market = None

  def handle_starttag(self, tag, attrs):
    if tag == 'span':
      if "class" in attrs and \
      and attrs["class"].indexof('market-name') != -1:
        self.in_market = 1
      elif self.in_market:
        self.in_market += 1
    elif self.in_market:
      if tag == 'a' and 'href' in attrs:
        self.market = attrs["href"]
      elif tag == 'button' and 'onclick' in attrs:
        add_to_cart_RE = re.compile(r'addToCart\((\d+),(\d+)\)')
        match = add_to_cart_RE.match(attrs["onclick"])
        self.markets[self.market] = [match.group(1), match.group(2)]


  def handle_endtag(self, tag):
    if self.tag == 'span' and self.in_market:
      self.in_market -= 1

  def handle_data(self, data):
    pass

如果您不清楚代码,请向我提问。

使用 lxml 库中的 html 解析器。对于下面的工作示例,您的 HTML 已分配给 myhtml。可能有更优雅的方法来解析按钮属性中的文本,但这是一个开始。

>>> from lxml import html
>>> tree = html.fromstring(myhtml)
>>> mybuttons = tree.xpath('//button[@class="btn btn-orange" and @onclick]')
>>> len(mybuttons)
4
>>> for button in mybuttons:
...     (id1, id2) = button.attrib['onclick'].replace('(', ' ').replace(',', ' ').replace(')', ' ').split()[1:]
...     print id1, id2
... 
1852864 1108
1841001 155000
1852853 441
1852846 6618
>>> myurl = tree.xpath('//span[@class="market-name"]/a')
>>> for u in myurl:
...     href = u.attrib['href']
...     print href
... 
http://csgo.steamanalyst.com/id/120565/
http://csgo.steamanalyst.com/id/62403692/
http://csgo.steamanalyst.com/id/1420/
http://csgo.steamanalyst.com/id/120409/
>>>