HTML 与 Python 混淆 BeautifulSoup
HTML confusion with Python BeautifulSoup
我在 youtube 上关注 thenewboston 的教程,在编译我的代码后我没有发现任何错误。
我正在尝试打印 "Generic Line List" 和该列表后面的所有 link;可以在这个 link 的底部找到
http://playrustwiki.com/wiki/List_of_Items
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages: #makes our pages change everytime
url = 'http://playrustwiki.com/wiki/List_of_Items' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text) #find all the links in soup or all the titles
for link in soup.findAll('a', {'class': 'a href'}): #links are a for anchors in HTML
href = link.get('href') # href attribute
print(href)
page += 1
trade_spider(1)
我尝试了不同的 HTML 属性,但我认为这就是我开始困惑的地方。我找不到正确的属性来调用我的抓取工具,或者我调用了错误的属性。
求助~
谢谢:)
这里的想法是找到具有 Generic line list
文本的元素。然后,通过 find_next_sibling()
找到下一个 ul
兄弟并通过 find_all()
:
获取所有内部链接
h3 = soup.find('h3', text='Generic Line List')
generic_line_list = h3.find_next_sibling('ul')
for link in generic_line_list.find_all('a', href=True):
print(link['href'])
演示:
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> url = 'http://playrustwiki.com/wiki/List_of_Items'
>>> soup = BeautifulSoup(requests.get(url).content)
>>>
>>> h3 = soup.find('h3', text='Generic Line List')
>>> generic_line_list = h3.find_next_sibling('ul')
>>> for link in generic_line_list.find_all('a', href=True):
... print(link['href'])
...
/wiki/Wood_Barricade
/wiki/Wood_Shelter
...
/wiki/Uber_Hunting_Bow
/wiki/Cooked_Chicken_Breast
/wiki/Anti-Radiation_Pills
我在 youtube 上关注 thenewboston 的教程,在编译我的代码后我没有发现任何错误。
我正在尝试打印 "Generic Line List" 和该列表后面的所有 link;可以在这个 link 的底部找到 http://playrustwiki.com/wiki/List_of_Items
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages: #makes our pages change everytime
url = 'http://playrustwiki.com/wiki/List_of_Items' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text) #find all the links in soup or all the titles
for link in soup.findAll('a', {'class': 'a href'}): #links are a for anchors in HTML
href = link.get('href') # href attribute
print(href)
page += 1
trade_spider(1)
我尝试了不同的 HTML 属性,但我认为这就是我开始困惑的地方。我找不到正确的属性来调用我的抓取工具,或者我调用了错误的属性。
求助~
谢谢:)
这里的想法是找到具有 Generic line list
文本的元素。然后,通过 find_next_sibling()
找到下一个 ul
兄弟并通过 find_all()
:
h3 = soup.find('h3', text='Generic Line List')
generic_line_list = h3.find_next_sibling('ul')
for link in generic_line_list.find_all('a', href=True):
print(link['href'])
演示:
>>> import requests
>>> from bs4 import BeautifulSoup
>>>
>>> url = 'http://playrustwiki.com/wiki/List_of_Items'
>>> soup = BeautifulSoup(requests.get(url).content)
>>>
>>> h3 = soup.find('h3', text='Generic Line List')
>>> generic_line_list = h3.find_next_sibling('ul')
>>> for link in generic_line_list.find_all('a', href=True):
... print(link['href'])
...
/wiki/Wood_Barricade
/wiki/Wood_Shelter
...
/wiki/Uber_Hunting_Bow
/wiki/Cooked_Chicken_Breast
/wiki/Anti-Radiation_Pills