网页抓取无序列表问题

Web Scraping Unordered List Issue

我正在阅读一本名为“通过构建数据科学应用程序学习 Python”的书,其中有一章是关于网络抓取的,我完全承认我以前没有玩过。我已经到达了讨论无序列表以及如何使用它们的部分,我的代码生成了一个对我来说没有意义的错误:

Traceback (most recent call last): File "/Users/gillian/100-days-of-code/Learn-Python-by-Building-Data-Science-Applications/Chapter07/wiki2.py", line 77, in list_element = front.find_next_siblings("div", "div-col columns column-width")[0].ul IndexError: list index out of range

我的第一个想法是页面上根本就没有无序列表了,但我检查了一下,然后......有。我对这个错误的解释是它没有返回列表,但我很难弄清楚如何测试它,我完全承认递归让我头晕,这不是我最好的领域。

附上我的完整代码(包括我做的笔记,因此有大量评论)

    '''scrapes list of WWII battles'''
import requests as rq

base_url = 'https://en.wikipedia.org/wiki/List_of_World_War_II_battles'
response = rq.get(base_url)

'''access the raw content of a page with response.content'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

def get_dom(url):
    response = rq.get(url)
    response.raise_for_status()
    return BeautifulSoup(response.content, 'html.parser')

'''3 ways to search for an element:
    1. find
    2. finda_all
    3. select

for 1 and 2 you pass an object type and attributes, maybe,

a recursive argument defines if the search should be recursive
First method retrieves first occurrence
Second method will always return a list with all elements
select will return a list and expects you to pass a single CSS selector string

this makes select easier to use, sometimes
'''
content = soup.select('div#mw-content-text > div.mw-parser-output', limit=1)[0]

'''
collect corresponding elements for each front, which are all h2 headers

all fronts are sections - each with a title in h2 but hierarchically the titles are not nested within the sections

last title is citations and notes

one way is to just drop the last element or we can use a CSS Selector trick, which is to specify :not(:last-of-type) but that is less readable

'''
fronts = content.select('div.mw-parser-output>h2')[:-1]

for el in fronts:
    print(el.text[:-6])

'''getting the corresponding ul lists for each header

bs4 has a find_next_siblings method that works like find_all except that it will look in the document after each element

to get this all simultaneously, we'll need to use recursion
'''

def dictify(ul, level=0):
    result = dict()
    for li in ul.find_all("li", recursive=False):
        text = li.stripped_strings
        key = next(text)
        try:
            time = next(text).replace(':', '').strip()
        except StopIteration:
            time = None
        ul, link = li.find("ul"), li.find('a')
        if link:
            link = _abs_link(link.get('href'))
        r ={'url': link,
            'time':time,
            'level': level}
        if ul:
            r['children'] = dictify(ul, level=(level + 1))
        result[key] = r
    return result

theaters = {}

for front in fronts:
    list_element = front.find_next_siblings("div", "div-col columns column-width")[0].ul
    theaters[front.text[:-6]] = dictify(list_element)

如果有人对我如何继续解决此问题有任何意见,我将不胜感激。谢谢

错误意味着.find_next_siblings没有找到任何东西。尝试将其更改为 front.find_next_siblings("div", "div-col")。另外 _abs_link() 没有指定,所以我删除了它:

"""scrapes list of WWII battles"""
import requests as rq

base_url = "https://en.wikipedia.org/wiki/List_of_World_War_II_battles"
response = rq.get(base_url)

"""access the raw content of a page with response.content"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, "html.parser")


def get_dom(url):
    response = rq.get(url)
    response.raise_for_status()
    return BeautifulSoup(response.content, "html.parser")


"""3 ways to search for an element:
    1. find
    2. finda_all
    3. select

for 1 and 2 you pass an object type and attributes, maybe,

a recursive argument defines if the search should be recursive
First method retrieves first occurrence
Second method will always return a list with all elements
select will return a list and expects you to pass a single CSS selector string

this makes select easier to use, sometimes
"""
content = soup.select("div#mw-content-text > div.mw-parser-output", limit=1)[0]

"""
collect corresponding elements for each front, which are all h2 headers

all fronts are sections - each with a title in h2 but hierarchically the titles are not nested within the sections

last title is citations and notes

one way is to just drop the last element or we can use a CSS Selector trick, which is to specify :not(:last-of-type) but that is less readable

"""
fronts = content.select("div.mw-parser-output>h2")[:-1]

for el in fronts:
    print(el.text[:-6])

"""getting the corresponding ul lists for each header

bs4 has a find_next_siblings method that works like find_all except that it will look in the document after each element

to get this all simultaneously, we'll need to use recursion
"""


def dictify(ul, level=0):
    result = dict()
    for li in ul.find_all("li", recursive=False):
        text = li.stripped_strings
        key = next(text)
        try:
            time = next(text).replace(":", "").strip()
        except StopIteration:
            time = None
        ul, link = li.find("ul"), li.find("a")
        if link:
            link = link.get("href")
        r = {"url": link, "time": time, "level": level}
        if ul:
            r["children"] = dictify(ul, level=(level + 1))
        result[key] = r
    return result


theaters = {}

for front in fronts:
    list_element = front.find_next_siblings("div", "div-col")[0].ul
    theaters[front.text[:-6]] = dictify(list_element)

print(theaters)

打印:

{
    "African Front": {
        "North African campaign": {
            "url": "/wiki/North_African_campaign",
            "time": "June 1940 - May 1943",
            "level": 0,
            "children": {
                "Western Desert campaign": {
                    "url": "/wiki/Western_Desert_campaign",
                    "time": "June 1940 – February 1943",
                    "level": 1,
                    "children": {
                        "Italian invasion of Egypt": {
                            "url": "/wiki/Italian_invasion_of_Egypt",
                            "time": "September 1940",
                            "level": 2,
                        },
                        "Operation Compass": {
                            "url": "/wiki/Operation_Compass",
                            "time": "December 1940 – February 1941",
                            "level": 2,
                            "children": {
                                "Battle of Nibeiwa": {
                                    "url": "/wiki/Battle_of_Nibeiwa",
                                    "time": "December 1940",
                                    "level": 3,
                                },

...and so on.