HTML 与 Python 混淆 BeautifulSoup

Question

我在 youtube 上关注 thenewboston 的教程，在编译我的代码后我没有发现任何错误。

我正在尝试打印 "Generic Line List" 和该列表后面的所有 link；可以在这个 link 的底部找到 http://playrustwiki.com/wiki/List_of_Items

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages: #makes our pages change everytime
        url = 'http://playrustwiki.com/wiki/List_of_Items' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text) #find all the links in soup or all the titles
        for link in soup.findAll('a', {'class': 'a href'}): #links are a for anchors in HTML
        href = link.get('href') # href attribute
        print(href)
        page += 1

trade_spider(1)

我尝试了不同的 HTML 属性，但我认为这就是我开始困惑的地方。我找不到正确的属性来调用我的抓取工具，或者我调用了错误的属性。

求助~

谢谢:)

Answer 1

这里的想法是找到具有 Generic line list 文本的元素。然后，通过 find_next_sibling() 找到下一个 ul 兄弟并通过 find_all():

获取所有内部链接

h3 = soup.find('h3', text='Generic Line List')
generic_line_list = h3.find_next_sibling('ul')
for link in generic_line_list.find_all('a', href=True):
    print(link['href'])

演示：

>>> import requests
>>> from bs4 import BeautifulSoup
>>> 
>>> url = 'http://playrustwiki.com/wiki/List_of_Items'
>>> soup = BeautifulSoup(requests.get(url).content)
>>>
>>> h3 = soup.find('h3', text='Generic Line List')
>>> generic_line_list = h3.find_next_sibling('ul')
>>> for link in generic_line_list.find_all('a', href=True):
...     print(link['href'])
... 
/wiki/Wood_Barricade
/wiki/Wood_Shelter
...
/wiki/Uber_Hunting_Bow
/wiki/Cooked_Chicken_Breast
/wiki/Anti-Radiation_Pills

HTML 与 Python 混淆 BeautifulSoup

HTML confusion with Python BeautifulSoup

beautifulsoup

html-parsing

web-scraping

python-3.x