如何只解析特定的标签值
How to parse only specific tag values
我是 Python 的新手,我已经编写了一些代码来解析来自 ESPN site!
的数据
#importing packages/modules
from bs4 import BeautifulSoup
import pandas as pd
import urllib
#additional data for scraping
url = 'http://espn.go.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/'
#scraping the site
page = urllib.request.urlopen(url + str(2)).read()
soup = BeautifulSoup(page)
#the header of the table
table_header = [ td.get_text() for td in soup.find_all('td')[:20] ]
table_header[-1] = 'SH A'
table_header[-2] = 'SH G'
table_header[-3] = 'PP A'
table_header[-4] = 'PP G'
table_header.remove('')
table_header.remove('PP')
table_header.remove('SH')
#the data for table
#print(player_names = [ a.get_text() for a in soup.find_all('tr') ])
player_name = [ a.get_text() for a in soup.find_all('tr')[2:12] ]
问题是 - 如何仅获取位于 <a>
标记之间的数据,因为如果我 print()
列出列表 player_name:
['1Jamie Benn, LWDAL823552871641.0625313.86101323', '2John Tavares, C NYI823848865461.0527813.78131801', '3Sidney Crosby, C PIT772856845471.0923711.83102100', '4Alex Ovechkin, LWWSH8153288110581.0039513.41125900', '\xa0Jakub Voracek, RWPHI822259811780.9922110.03112200', '6Nicklas Backstrom, C WSH821860785400.9515311.8333000', '7Tyler Seguin, C DAL71374077-1201.0828013.25131600', '8Jiri Hudler, LWCGY7831457617140.9715819.6561000', '\xa0Daniel Sedin, LWVAN822056765180.932268.9542100', '10Vladimir Tarasenko, RWSTL7737367327310.9526414.0681000']
非常感谢您的帮助!
我会使用 CSS selector 仅定位数据行(包含玩家数据):
for tr in soup.select("#my-players-table tr[class*=player]"):
player_name = tr('td')[1].get_text(strip=True)
print(player_name)
class*=player
表示"class attribute contains player"。
打印:
Jamie Benn, LW
John Tavares, C
Sidney Crosby, C
Alex Ovechkin, LW
...
Jordan Eberle, RW
Ondrej Palat, LW
Zach Parise, LW
我是 Python 的新手,我已经编写了一些代码来解析来自 ESPN site!
的数据#importing packages/modules
from bs4 import BeautifulSoup
import pandas as pd
import urllib
#additional data for scraping
url = 'http://espn.go.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/'
#scraping the site
page = urllib.request.urlopen(url + str(2)).read()
soup = BeautifulSoup(page)
#the header of the table
table_header = [ td.get_text() for td in soup.find_all('td')[:20] ]
table_header[-1] = 'SH A'
table_header[-2] = 'SH G'
table_header[-3] = 'PP A'
table_header[-4] = 'PP G'
table_header.remove('')
table_header.remove('PP')
table_header.remove('SH')
#the data for table
#print(player_names = [ a.get_text() for a in soup.find_all('tr') ])
player_name = [ a.get_text() for a in soup.find_all('tr')[2:12] ]
问题是 - 如何仅获取位于 <a>
标记之间的数据,因为如果我 print()
列出列表 player_name:
['1Jamie Benn, LWDAL823552871641.0625313.86101323', '2John Tavares, C NYI823848865461.0527813.78131801', '3Sidney Crosby, C PIT772856845471.0923711.83102100', '4Alex Ovechkin, LWWSH8153288110581.0039513.41125900', '\xa0Jakub Voracek, RWPHI822259811780.9922110.03112200', '6Nicklas Backstrom, C WSH821860785400.9515311.8333000', '7Tyler Seguin, C DAL71374077-1201.0828013.25131600', '8Jiri Hudler, LWCGY7831457617140.9715819.6561000', '\xa0Daniel Sedin, LWVAN822056765180.932268.9542100', '10Vladimir Tarasenko, RWSTL7737367327310.9526414.0681000']
非常感谢您的帮助!
我会使用 CSS selector 仅定位数据行(包含玩家数据):
for tr in soup.select("#my-players-table tr[class*=player]"):
player_name = tr('td')[1].get_text(strip=True)
print(player_name)
class*=player
表示"class attribute contains player"。
打印:
Jamie Benn, LW
John Tavares, C
Sidney Crosby, C
Alex Ovechkin, LW
...
Jordan Eberle, RW
Ondrej Palat, LW
Zach Parise, LW