如何只解析特定的标签值

Question

我是 Python 的新手，我已经编写了一些代码来解析来自 ESPN site!

的数据

#importing packages/modules
from bs4 import BeautifulSoup
import pandas as pd
import urllib

#additional data for scraping
url = 'http://espn.go.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/'

#scraping the site
page = urllib.request.urlopen(url + str(2)).read()
soup = BeautifulSoup(page)

#the header of the table
table_header = [ td.get_text() for td in soup.find_all('td')[:20] ]
table_header[-1] = 'SH A'
table_header[-2] = 'SH G'
table_header[-3] = 'PP A'
table_header[-4] = 'PP G'
table_header.remove('')
table_header.remove('PP')
table_header.remove('SH')

#the data for table


#print(player_names = [ a.get_text() for a in soup.find_all('tr') ])
player_name = [ a.get_text() for a in soup.find_all('tr')[2:12] ]

问题是 - 如何仅获取位于 <a> 标记之间的数据，因为如果我 print() 列出列表 player_name：

['1Jamie Benn, LWDAL823552871641.0625313.86101323', '2John Tavares, C NYI823848865461.0527813.78131801', '3Sidney Crosby, C PIT772856845471.0923711.83102100', '4Alex Ovechkin, LWWSH8153288110581.0039513.41125900', '\xa0Jakub Voracek, RWPHI822259811780.9922110.03112200', '6Nicklas Backstrom, C WSH821860785400.9515311.8333000', '7Tyler Seguin, C DAL71374077-1201.0828013.25131600', '8Jiri Hudler, LWCGY7831457617140.9715819.6561000', '\xa0Daniel Sedin, LWVAN822056765180.932268.9542100', '10Vladimir Tarasenko, RWSTL7737367327310.9526414.0681000']

非常感谢您的帮助！

Answer 1

我会使用 CSS selector 仅定位数据行（包含玩家数据）：

for tr in soup.select("#my-players-table tr[class*=player]"):
    player_name = tr('td')[1].get_text(strip=True)
    print(player_name)

class*=player表示"class attribute contains player"。

打印：

Jamie Benn, LW
John Tavares, C
Sidney Crosby, C
Alex Ovechkin, LW
...
Jordan Eberle, RW
Ondrej Palat, LW
Zach Parise, LW

如何只解析特定的标签值

How to parse only specific tag values

python

html-parsing

bs4