为什么我无法在 Python 中解析此 HTML 页面?
Why I cannot parse this HTML page in Python?
我正在尝试使用 BeautifulSoup 解析 Python 中网页 http://www.baseball-reference.com/teams/BOS/2000-pitching.shtml 中的信息。我想为table“团队投球”打印出每个球员的相应姓名。但是,代码在某个特定名称之后重复球员的姓名(在本例中,在第 15 行之后,它重复了名称'Pedro Martinez'). 例如:
1 Pedro Martinez
2 Jeff Fassero*
3 Ramon Martinez
4 Pete Schourek*
5 Rolando Arrojo
6 Tomo Ohka
7 Derek Lowe
8 Tim Wakefield
9 Rich Garces
10 Rheal Cormier*
11 Hipolito Pichardo
12 Brian Rose
13 Bryce Florie
14 John Wasdin
15 Pedro Martinez
16 Jeff Fassero*
17 Ramon Martinez
18 Pete Schourek*
19 Rolando Arrojo
20 Tomo Ohka
21 Derek Lowe
22 Tim Wakefield
23 Rich Garces
24 Rheal Cormier*
25 Hipolito Pichardo
26 Brian Rose
27 Bryce Florie
28 John Wasdin
你知道发生了什么事吗?这是我的代码:
# Sample web page
#http://www.baseball-reference.com/teams/BOS/2000-pitching.shtml
import urllib2
import lxml
from bs4 import BeautifulSoup
# Download webpages 2010 webpage
y = 2000
url = 'http://www.baseball-reference.com/teams/BOS/'+ str(y) +'-pitching.shtml'
print 'Download from :', url
#dowlnload
filehandle = urllib2.urlopen(url)
fileout = 'YEARS'+str(y)+'.html'
print 'Save to : ', fileout, '\n'
#save file to disk
f = open(fileout,'w')
f.write(filehandle.read())
f.close()
# Read and parse the html file
# Parse information about the age of players in 2000
y = 2000
filein = 'YEARS' + str(y) + '.html'
print(filein)
soup = BeautifulSoup(open(filein))
entries = soup.find_all('tr', attrs={'class' : ''}) #' non_qual' ''
print(len(entries))
i = 0
for entry in entries:
columns = entry.find_all('td')
#print(len(columns), 'i:', i)
if len (columns)==34: # Number of columns of the table
i = i + 1
#print i, len(columns)
age = columns[2].get_text()
print i, age
您试图遍历 table 中的所有行,但实际上并未先获取所有 table 标签。所以你得到所有的 table 标签,然后循环遍历 table 标签中的所有 tr 标签,如果这样的话。 year
和 table
也未定义,所以我假设年份是 y
并使 table
变量成为 t
。作为旁注,您不必下载 HTML 然后打开它来解析它,您可以通过获取连接文本并直接解析来获取 HTML。
import urllib2
from bs4 import BeautifulSoup
# Download webpages 2010 webpage
y = 2010
url = 'http://www.baseball-reference.com/teams/BOS/'+ str(y) +'-pitching.shtml'
print 'Download from :', url
#dowlnload
filehandle = urllib2.urlopen(url)
fileout = 'YEARS'+str(y)+'.html'
print 'Save to : ', fileout, '\n'
#save file to disk
f = open(fileout,'w')
f.write(filehandle.read())
f.close()
# Read and parse the html file
# Parse information about the age of players in 2000
y = 2010
filein = 'YEARS' + str(y) + '.html'
print(filein)
soup = BeautifulSoup(open(filein))
table = soup.find_all('table', attrs={'id': 'team_pitching'}) #' non_qual' ''
for t in table:
i = 1
entries = t.find_all('tr', attrs={'class' : ''}) #' non_qual' ''
print(len(entries))
for entry in entries:
columns = entry.find_all('td')
printString = str(i) + ' '
for col in columns:
try:
if ((',' in col['csk']) and (col['csk'] != '')):
printString = printString + col.text
i = i + 1
print printString
except:
pass
我正在尝试使用 BeautifulSoup 解析 Python 中网页 http://www.baseball-reference.com/teams/BOS/2000-pitching.shtml 中的信息。我想为table“团队投球”打印出每个球员的相应姓名。但是,代码在某个特定名称之后重复球员的姓名(在本例中,在第 15 行之后,它重复了名称'Pedro Martinez'). 例如:
1 Pedro Martinez
2 Jeff Fassero*
3 Ramon Martinez
4 Pete Schourek*
5 Rolando Arrojo
6 Tomo Ohka
7 Derek Lowe
8 Tim Wakefield
9 Rich Garces
10 Rheal Cormier*
11 Hipolito Pichardo
12 Brian Rose
13 Bryce Florie
14 John Wasdin
15 Pedro Martinez
16 Jeff Fassero*
17 Ramon Martinez
18 Pete Schourek*
19 Rolando Arrojo
20 Tomo Ohka
21 Derek Lowe
22 Tim Wakefield
23 Rich Garces
24 Rheal Cormier*
25 Hipolito Pichardo
26 Brian Rose
27 Bryce Florie
28 John Wasdin
你知道发生了什么事吗?这是我的代码:
# Sample web page
#http://www.baseball-reference.com/teams/BOS/2000-pitching.shtml
import urllib2
import lxml
from bs4 import BeautifulSoup
# Download webpages 2010 webpage
y = 2000
url = 'http://www.baseball-reference.com/teams/BOS/'+ str(y) +'-pitching.shtml'
print 'Download from :', url
#dowlnload
filehandle = urllib2.urlopen(url)
fileout = 'YEARS'+str(y)+'.html'
print 'Save to : ', fileout, '\n'
#save file to disk
f = open(fileout,'w')
f.write(filehandle.read())
f.close()
# Read and parse the html file
# Parse information about the age of players in 2000
y = 2000
filein = 'YEARS' + str(y) + '.html'
print(filein)
soup = BeautifulSoup(open(filein))
entries = soup.find_all('tr', attrs={'class' : ''}) #' non_qual' ''
print(len(entries))
i = 0
for entry in entries:
columns = entry.find_all('td')
#print(len(columns), 'i:', i)
if len (columns)==34: # Number of columns of the table
i = i + 1
#print i, len(columns)
age = columns[2].get_text()
print i, age
您试图遍历 table 中的所有行,但实际上并未先获取所有 table 标签。所以你得到所有的 table 标签,然后循环遍历 table 标签中的所有 tr 标签,如果这样的话。 year
和 table
也未定义,所以我假设年份是 y
并使 table
变量成为 t
。作为旁注,您不必下载 HTML 然后打开它来解析它,您可以通过获取连接文本并直接解析来获取 HTML。
import urllib2
from bs4 import BeautifulSoup
# Download webpages 2010 webpage
y = 2010
url = 'http://www.baseball-reference.com/teams/BOS/'+ str(y) +'-pitching.shtml'
print 'Download from :', url
#dowlnload
filehandle = urllib2.urlopen(url)
fileout = 'YEARS'+str(y)+'.html'
print 'Save to : ', fileout, '\n'
#save file to disk
f = open(fileout,'w')
f.write(filehandle.read())
f.close()
# Read and parse the html file
# Parse information about the age of players in 2000
y = 2010
filein = 'YEARS' + str(y) + '.html'
print(filein)
soup = BeautifulSoup(open(filein))
table = soup.find_all('table', attrs={'id': 'team_pitching'}) #' non_qual' ''
for t in table:
i = 1
entries = t.find_all('tr', attrs={'class' : ''}) #' non_qual' ''
print(len(entries))
for entry in entries:
columns = entry.find_all('td')
printString = str(i) + ' '
for col in columns:
try:
if ((',' in col['csk']) and (col['csk'] != '')):
printString = printString + col.text
i = i + 1
print printString
except:
pass