第二行和第三行应该是单行

Question

from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring
import re
import csv
import pandas as pd

wiki = "http://en.wikipedia.org/wiki/List_of_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

try:
    table = soup.find_all('table')[1]
except AttributeError as e:
    print 'No tables found, exiting'


try:
    rows = table.find_all('tr')
except AttributeError as e:
    print 'No table rows found, exiting'


try:
    first = table.find_all('tr')[0]
except AttributeError as e:
    print 'No table row found, exiting'

try:
    allRows = table.find_all('tr')[1:]
except AttributeError as e:
    print 'No table row found, exiting'

results = [] 

firstRow = first.find_all('td')
results.append([header.get_text() for header in firstRow])

for row in allRows:
    table_headers = row.find_all('th')
    table_data = row.find_all('td')
    if table_headers : 
        results.append([headers.get_text() for headers in table_headers])    
    if table_data :
        results.append([data.get_text() for data in table_data])

df = pd.DataFrame(data = results)
df

期望的输出：

Margin             Teams          Venue                        Season

Innings and 579 runs | England (903-7 d) beat Australia (201 & 123) | The Oval, London  | 1938

Innings and 360 runs | Australia (652–7 d) beat South Africa (159 & ..| New Wanderers Stadium, Johannesburg | 2001–02

Innings and 336 runs | West Indies (614–5 d) beat India (124 & 154) |  Eden Gardens, Kolkata | 1958–59

Innings and 332 runs | Australia (645) beat England (141 & 172) | Brisbane Cricket Ground | 1946–47

Innings and 324 runs | Pakistan (643) beat New Zealand (73 & 246) | Gaddafi Stadium, Lahore | 2002

Answer 1

您需要同时收集 th 和 td 标签：

for row in allRows:
    results.append([data.get_text() for data in row.find_all(['th', 'td'])])

而且，别忘了省略最后一行，它里面只有 Last updated: ... 个文本：

allRows = table.find_all('tr')[1:-1]

此外，如果您希望数据框中的列名与页面上的 table headers 匹配，您需要在创建数据框时指定 columns 关键字参数：

headers = [header.get_text() for header in first.find_all('td')]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]

df = pd.DataFrame(data=results, columns=headers)
print(df)

产生：

                 Margin                                              Teams  \
0  Innings and 579 runs       England (903-7 d) beat Australia (201 & 123)   
1  Innings and 360 runs   Australia (652–7 d) beat South Africa (159 & ...   
2  Innings and 336 runs       West Indies (614–5 d) beat India (124 & 154)   
3  Innings and 332 runs           Australia (645) beat England (141 & 172)   
4  Innings and 324 runs         Pakistan (643) beat New Zealand (73 & 246)   

                                 Venue   Season  
0                     The Oval, London     1938  
1  New Wanderers Stadium, Johannesburg  2001–02  
2                Eden Gardens, Kolkata  1958–59  
3              Brisbane Cricket Ground  1946–47  
4              Gaddafi Stadium, Lahore     2002

第二行和第三行应该是单行

The second row and third row should be a single row

html

python

beautifulsoup

html-parsing

pandas