从 URLs 中抓取数据：如何检索所有缺少和未知最终页面 ID 的 URL 页面

Question

我想拉取一组网页的数据。

这是 URL 的示例：

http://www.signalpeptide.de/index.php?sess=&m=listspdb_mammalia&s=details&id=3&listname=

我的问题是：

URL中的'id='数字在不同页面之间变化。
我想遍历并检索数据库中的所有页面。
可能会缺少 id（例如，可能会有 id=3 和 id=6 的页面，但不会有 id=4 和 id=5）。
我不知道ID的最终数字是多少（例如，数据库中的最后一页可能是id=100000或id=1000000000，我不知道）。

我知道我需要的一般两行代码是以某种方式制作一个数字列表，然后使用此代码循环遍历数字以拉下每一页的文本（解析文本本身是另一回事一天的工作):

import urllib2
from bs4 import BeautifulSoup
web_page = "http://www.signalpeptide.de/index.php?sess=&m=listspdb_mammalia&s=details&id=" + id_name + "&listname="
page = urllib2.urlopen(web_page)
 soup = BeautifulSoup(page,'html.parser')

任何人都可以建议 'take all of the pages' 解决我面临的丢失页面和不知道最后一页是什么时候的问题的最佳方式吗？

Answer 1

为了得到可能的页面，你可以这样做（我的例子是Python3）：

import re
from urllib.request import urlopen
from lxml import html

ITEMS_PER_PAGE = 50

base_url = 'http://www.signalpeptide.de/index.php'
url_params = '?sess=&m=listspdb_mammalia&start={}&orderby=id&sortdir=asc'


def get_pages(total):
    pages = [i for i in range(ITEMS_PER_PAGE, total, ITEMS_PER_PAGE)]
    last = pages[-1]
    if last < total:
        pages.append(last + (total - last))
    return pages

def generate_links():
    start_url = base_url + url_params.format(ITEMS_PER_PAGE)
    page = urlopen(start_url).read()
    dom = html.fromstring(page)
    xpath = '//div[@class="content"]/table[1]//tr[1]/td[3]/text()'
    pagination_text = dom.xpath(xpath)[0]
    total = int(re.findall(r'of\s(\w+)', pagination_text)[0])
    print(f'Number of records to scrape: {total}')
    pages = get_pages(total)
    links = (base_url + url_params.format(i) for i in pages)
    return links

基本上就是抓取第一页，获取记录数，假设每页有50条记录，get_pages()函数可以计算传递给 start 参数并生成所有分页 URL 的页码，您需要获取所有这些页面，用每个蛋白质迭代 table 并转到详细信息页面以使用 BeautifulSoup 或带 XPath 的 lxml 获取您需要的信息。我尝试使用 asyncio 同时获取所有这些页面，但服务器超时 :)。希望我的功能对您有所帮助！

从 URLs 中抓取数据：如何检索所有缺少和未知最终页面 ID 的 URL 页面

Scraping data from URLs: how to retrieve all the URL pages with missing and unknown final page IDs

python

urllib2

beautifulsoup