第二行和第三行应该是单行
The second row and third row should be a single row
from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring
import re
import csv
import pandas as pd
wiki = "http://en.wikipedia.org/wiki/List_of_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
try:
table = soup.find_all('table')[1]
except AttributeError as e:
print 'No tables found, exiting'
try:
rows = table.find_all('tr')
except AttributeError as e:
print 'No table rows found, exiting'
try:
first = table.find_all('tr')[0]
except AttributeError as e:
print 'No table row found, exiting'
try:
allRows = table.find_all('tr')[1:]
except AttributeError as e:
print 'No table row found, exiting'
results = []
firstRow = first.find_all('td')
results.append([header.get_text() for header in firstRow])
for row in allRows:
table_headers = row.find_all('th')
table_data = row.find_all('td')
if table_headers :
results.append([headers.get_text() for headers in table_headers])
if table_data :
results.append([data.get_text() for data in table_data])
df = pd.DataFrame(data = results)
df
期望的输出:
Margin Teams Venue Season
Innings and 579 runs | England (903-7 d) beat Australia (201 & 123) | The Oval, London | 1938
Innings and 360 runs | Australia (652–7 d) beat South Africa (159 & ..| New Wanderers Stadium, Johannesburg | 2001–02
Innings and 336 runs | West Indies (614–5 d) beat India (124 & 154) | Eden Gardens, Kolkata | 1958–59
Innings and 332 runs | Australia (645) beat England (141 & 172) | Brisbane Cricket Ground | 1946–47
Innings and 324 runs | Pakistan (643) beat New Zealand (73 & 246) | Gaddafi Stadium, Lahore | 2002
您需要同时收集 th
和 td
标签:
for row in allRows:
results.append([data.get_text() for data in row.find_all(['th', 'td'])])
而且,别忘了省略最后一行,它里面只有 Last updated: ...
个文本:
allRows = table.find_all('tr')[1:-1]
此外,如果您希望数据框中的列名与页面上的 table headers 匹配,您需要在创建数据框时指定 columns
关键字参数:
headers = [header.get_text() for header in first.find_all('td')]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]
df = pd.DataFrame(data=results, columns=headers)
print(df)
产生:
Margin Teams \
0 Innings and 579 runs England (903-7 d) beat Australia (201 & 123)
1 Innings and 360 runs Australia (652–7 d) beat South Africa (159 & ...
2 Innings and 336 runs West Indies (614–5 d) beat India (124 & 154)
3 Innings and 332 runs Australia (645) beat England (141 & 172)
4 Innings and 324 runs Pakistan (643) beat New Zealand (73 & 246)
Venue Season
0 The Oval, London 1938
1 New Wanderers Stadium, Johannesburg 2001–02
2 Eden Gardens, Kolkata 1958–59
3 Brisbane Cricket Ground 1946–47
4 Gaddafi Stadium, Lahore 2002
from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring
import re
import csv
import pandas as pd
wiki = "http://en.wikipedia.org/wiki/List_of_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
try:
table = soup.find_all('table')[1]
except AttributeError as e:
print 'No tables found, exiting'
try:
rows = table.find_all('tr')
except AttributeError as e:
print 'No table rows found, exiting'
try:
first = table.find_all('tr')[0]
except AttributeError as e:
print 'No table row found, exiting'
try:
allRows = table.find_all('tr')[1:]
except AttributeError as e:
print 'No table row found, exiting'
results = []
firstRow = first.find_all('td')
results.append([header.get_text() for header in firstRow])
for row in allRows:
table_headers = row.find_all('th')
table_data = row.find_all('td')
if table_headers :
results.append([headers.get_text() for headers in table_headers])
if table_data :
results.append([data.get_text() for data in table_data])
df = pd.DataFrame(data = results)
df
期望的输出:
Margin Teams Venue Season
Innings and 579 runs | England (903-7 d) beat Australia (201 & 123) | The Oval, London | 1938
Innings and 360 runs | Australia (652–7 d) beat South Africa (159 & ..| New Wanderers Stadium, Johannesburg | 2001–02
Innings and 336 runs | West Indies (614–5 d) beat India (124 & 154) | Eden Gardens, Kolkata | 1958–59
Innings and 332 runs | Australia (645) beat England (141 & 172) | Brisbane Cricket Ground | 1946–47
Innings and 324 runs | Pakistan (643) beat New Zealand (73 & 246) | Gaddafi Stadium, Lahore | 2002
您需要同时收集 th
和 td
标签:
for row in allRows:
results.append([data.get_text() for data in row.find_all(['th', 'td'])])
而且,别忘了省略最后一行,它里面只有 Last updated: ...
个文本:
allRows = table.find_all('tr')[1:-1]
此外,如果您希望数据框中的列名与页面上的 table headers 匹配,您需要在创建数据框时指定 columns
关键字参数:
headers = [header.get_text() for header in first.find_all('td')]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]
df = pd.DataFrame(data=results, columns=headers)
print(df)
产生:
Margin Teams \
0 Innings and 579 runs England (903-7 d) beat Australia (201 & 123)
1 Innings and 360 runs Australia (652–7 d) beat South Africa (159 & ...
2 Innings and 336 runs West Indies (614–5 d) beat India (124 & 154)
3 Innings and 332 runs Australia (645) beat England (141 & 172)
4 Innings and 324 runs Pakistan (643) beat New Zealand (73 & 246)
Venue Season
0 The Oval, London 1938
1 New Wanderers Stadium, Johannesburg 2001–02
2 Eden Gardens, Kolkata 1958–59
3 Brisbane Cricket Ground 1946–47
4 Gaddafi Stadium, Lahore 2002