只能用 Python 和 BS4 抓取 Table 的部分

Question

的草稿订单抓取 table

我遇到了一个问题，唯一提取的数据来自背景颜色不同的行（整数旁边带有“*”的行）。

我的代码如下：

wikiURL = "https://en.wikipedia.org/wiki/2012_NFL_Draft"

#create array to store player info in
teams_players = []

# request and parse wikiURL
r = requests.get(wikiURL)
soup = BeautifulSoup(r.content, "html.parser")

#find table in wikipedia
playerData = soup.find('table', {"class": "wikitable sortable"})

for row in playerData.find_all('tr'):
    cols = row.find_all('td')

    if len(cols) == 9: 

        teams_players.append((cols[3].text.strip(), cols[4].text.strip()))

for team, player in teams_players:
    print('{:35} {}'.format(team, player))

Answer 1

那是因为 if len(cols) == 9: 条件。您需要：

跳过第 header 行
在每个 tr

td

th

跳过计数小于 6 的行

固定版本：

for row in playerData.find_all('tr')[1:]:
    cols = row.find_all(['td', 'th'])
    if len(cols) < 6:
        continue
    teams_players.append((cols[5].text.strip(), cols[6].text.strip()))

打印：

QB                                  Stanford
QB                                  Baylor
...
RB                                  Abilene Christian
QB                                  NIU

只能用 Python 和 BS4 抓取 Table 的部分

Can Only Scrape Portion of Table with Python and BS4

python

wikipedia

beautifulsoup

python-requests

bs4