如何通过格式化的网络抓取 for 循环过滤列表

Question

我有一份篮球运动员名单，我想通过我已经设置好的网络抓取循环。球员名单是2011年NBA选秀顺位名单。我想遍历每个球员并从他们大学最后一年开始获得他们的大学统计数据。问题是一些被选中的球员没有上过大学，因此他们的名字中没有 url 格式，所以每次我传递一个没有上过大学的球员时，整个代码都会出错。我试过包括“通过”和“继续”，但似乎没有任何效果。这是我到目前为止最接近的：

from bs4 import BeautifulSoup
import requests 
import pandas as pd 
headers = {'User Agent':'Mozilla/5.0'}

players = [
   'kyrie-irving','derrick-williams','enes-kanter',
   'tristan-thompson','jonas-valanciunas','jan-vesely',
   'bismack-biyombo','brandon-knight','kemba-walker,
   'jimmer-fredette','klay-thompson'
]
#the full list of players goes on for a total of 60 players, this is just the first handful

player_stats = []

 for player in players:
    url = (f'https://www.sports-reference.com/cbb/players/{player}-1.html')
    res = requests.get(url)
    #if player in url:
        #continue
    #else:
        #print("This player has no college stats")
#Including this if else statement makes the error say header is not defined. When not included, the error says NoneType object is not iterable       
    soup = BeautifulSoup(res.content, 'lxml')
    header = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]
    rows = soup.findAll('tr')
    player_stats.append([td.getText() for td in soup.find('tr', id ='players_per_game.2011')])
    player_stats

graph = pd.DataFrame(player_stats, columns = header)

Answer 1

你可以做 2 件事中的 1 件事：

检查响应状态代码。 200 是响应成功，其他都是错误。问题是某些站点将有一个有效的 html 页面来表示“无效页面”，因此您仍然可以获得成功的 200 响应。
只需使用try/except。如果失败，继续列表中的下一项

由于选项 1 存在问题，请在此处选择选项 2。此外，您是否考虑过使用 pandas 来解析 table？做起来更容易一些（并在引擎盖下使用 BeautifulSoup）？

最后，您需要对此做更多的逻辑分析。有多位大学球员“德里克·威廉”。我怀疑你的意思不是 https://www.sports-reference.com/cbb/players/derrick-williams-1.html。所以你需要弄清楚如何解决这个问题。

from bs4 import BeautifulSoup
import requests 
import pandas as pd 
headers = {'User Agent':'Mozilla/5.0'}

players = [
   'kyrie-irving','derrick-williams','enes-kanter',
   'tristan-thompson','jonas-valanciunas','jan-vesely',
   'bismack-biyombo','brandon-knight','kemba-walker',
   'jimmer-fredette','klay-thompson'
]
#the full list of players goes on for a total of 60 players, this is just the first handful

player_stats = []

for player in players:
    url = (f'https://www.sports-reference.com/cbb/players/{player}-1.html')
    res = requests.get(url)

    try:
        soup = BeautifulSoup(res.content, 'lxml')
        header = [th.getText() for th in soup.findAll('tr', limit = 2)[0].findAll('th')]
        rows = soup.findAll('tr')
        player_stats.append([td.getText() for td in soup.find('tr', id ='players_per_game.2011')])
        player_stats
    except:
        print("%s has no college stats" %player)

graph = pd.DataFrame(player_stats, columns = header)

与Pandas:

graph = pd.DataFrame()
for player in players:
    try:
        url = (f'https://www.sports-reference.com/cbb/players/{player}-1.html')
        df = pd.read_html(url)[0]
        cols = list(df.columns)
        
        df = df.iloc[-2][cols]
        df['Player'] = player
        
        graph = graph.append(df).reset_index(drop=True)
        graph = graph[['Player'] + cols]
    except:
        print("%s has no college stats" %player)

如何通过格式化的网络抓取 for 循环过滤列表

How do I filter a list through a formatted web scraping for loop

python

screen-scraping

web