BeautifulSoup seemingly-randomly 尽管有 100 个结果，但从页面中删除了 23、42 或 87 个结果

Question

EDIT/UPDATE: 我开始使用 BeautifulSoup 作为下面提到的 alecxe，但现在我收到了看似随机的结果范围。有时它 returns 23，大部分时间 42，有时 87。如果我 re-scape 相同的页面，我不会得到相同的结果。 95% 的时间它检索 42 个项目... 有人知道发生了什么事吗？（full-size link 这里：http://i.imgur.com/YeeupLh.png）

我使用了与此处看到的 Alecxe 相似的代码，但我相信两者都有相同的问题

def parse(self, response):
data = json.loads(response.body)['results_html']
soup = BeautifulSoup(data, "lxml")

prices = [float(price.strip(r"\r\n\t").replace('$','').split(" ")[0]) 
          for price in soup.find_all(text=re.compile(r"USD"))]

我的部分代码可以在这里看到：http://pastebin.com/y7hypCmv

（上一页）无论出于何种原因，我的抓取工具坚持抓取 19 页结果，而不是可用的 100 页结果。这是我的蜘蛛：

from scrapy import Request, Spider
from scrapy.selector import Selector
from idem.items import IdemItem


URL = 'http://steamcommunity.com/market/search/render/?query=&start={page}&count=100' # Note, this is pre-formatted HTML

class MySpider(Spider):
    handle_httpstatus = 200
    name = "postings"

    def start_requests(self):
        index = 0
        while True:
            yield Request(URL.format(page=index))
            index +=100
            if index >= 200: break
    def parse(self,response):
        sel = Selector(response)
        items = []
        item = IdemItem()
        item["price"] = sel.xpath("//text()[contains(.,'$')]").extract()
        item["supply"] = sel.xpath("//span[@class[contains(.,'market_listing_num')]]/text()").extract()
        item["_id"] = sel.xpath("//span[@class[contains(.,'market_listing_item_name')]]/text()[1]").extract()

        for price, supply, _id in zip(item["price"], item["supply"], item["_id"]):
            item = IdemItem()
            item["price"] = float(price.strip(r"\r\n\t").replace('$',''))
            item["supply"] = int(supply.strip(r"\r\n\t").replace(',',''))
            item["_id"] = _id.strip(r"\r\n\t").replace(r'u2605','\u2605').decode('unicode-escape')
            items.append(item)
        return items

如果我更改 count=19 和 index +=19，我可以提取所有数据，但我宁愿同时抓取所有 100 个列表！

这是抓取后的 shell 结果：

  'downloader/request_count': 2,
  'downloader/request_method_count/GET': 2,
  'downloader/response_bytes': 31456,
  'downloader/response_count': 2,
  'downloader/response_status_count/200': 2,
  'finish_reason': 'finished',
  'finish_time': datetime.datetime(2015, 4, 10, 16, 13, 46, 409000),
  'item_scraped_count': 38, #-----(19 results x 2 pages)-------#
  'log_count/DEBUG': 80,
  'log_count/INFO': 7,
  'response_received_count': 2,
  'scheduler/dequeued': 2,
  'scheduler/dequeued/memory': 2,
  'scheduler/enqueued': 2,
  'scheduler/enqueued/memory': 2,
  'start_time': datetime.datetime(2015, 4, 10, 16, 13, 44, 808000)}

真的任何建议都会有帮助！

Answer 1

这是对我有用的（需要安装 BeautifulSoup）：

def parse(self, response):
    data = json.loads(response.body)['results_html']
    soup = BeautifulSoup(data, "lxml")

    prices = [float(price.strip(r"\r\n\t").replace('$','').split(" ")[0]) 
              for price in soup.find_all(text=re.compile(r"USD"))]

价格会是这样的列表：

[0.08, 0.08, 0.04, 0.08, 0.05, 0.11, 0.08, 0.03, 0.06, 0.07, 0.06, 0.06, 0.11, 0.07, 0.08, 0.08, 0.07, 0.07, 0.08, 0.08, 0.12, 0.08, 0.07, 0.11, .
 ...
 0.04, 0.11, 0.04, 0.04, 0.04, 0.06, 0.04, 0.09, 0.06, 0.12, 0.04, 0.06, 0.07, 0.04, 0.05, 0.04]

仅供参考，我已经尝试了不同的定位技术，但还没有使其适用于 Scrapy-only。

Answer 2

想通了。

无论出于何种原因，BeautifulSoup HTML 解析器变得混乱并且没有返回正确数量的结果。

我利用 Python 内置的 HTML 解析器解决了这个问题，它始终如一地 returns 100 个结果。

soup = BeautifulSoup(data, "html.parser")

BeautifulSoup seemingly-randomly 尽管有 100 个结果，但从页面中删除了 23、42 或 87 个结果

BeautifulSoup seemingly-randomly scapes 23, 42 or 87 results from the page despite there being 100

python

xpath

beautifulsoup

scrapy