美汤刮痧

Scraping with Beautiful Soup

我偶然发现了这个关于使用 Beautiful Soup 进行抓取的优秀 post,我决定承担从互联网上抓取一些数据的任务。

我正在使用来自 Flight Radar 24 的航班数据,并使用博客 post 中描述的内容尝试自动抓取航班数据页面。

import requests
import bs4

root_url = 'http://www.flightradar24.com'
index_url = root_url + '/data/flights/tigerair-tgw/'


def get_flight_id_urls():
     response = requests.get(index_url)
     soup = bs4.BeautifulSoup(response.text)
     return [a.attrs.get('href') for a in soup.select('div.list-group a[href^=/data]')]


flight_id_urls = get_flight_id_urls()
for flight_id_url in flight_id_urls:
    temp_url = root_url + flight_id_url
    response = requests.get(temp_url)
    soup = bs4.BeautifulSoup(response.text)

try:
    table = soup.find('table')
    rows = table.find_all('tr')
    for row in rows:
        flight_data = {}
        flight_data['title'] = soup.select('div#cntPagePreTitle h1')[0].get_text()
        flight_data['tr'] = row #error here
        print (flight_data)

except AttributeError as e:
    raise ValueError("No valid table found")

flight data page

样本

我迷迷糊糊直到 table 然后意识到我不知道如何横向向下 table 属性来获取嵌入在每一列中的数据。

任何善良的人都有任何线索,甚至是介绍教程,以便我可以阅读如何提取数据。

P.S:感谢 Miguel Grinberg 出色的教程

已添加

try:
table = soup.find('table')
rows = table.find_all('tr')
heads = [i.text.strip() for i in table.select('thead th')]
for tr in table.select('tbody tr'):
    flight_data = {}
    flight_data['title'] = soup.select('div#cntPagePreTitle h1')[0].get_text()
    flight_data['From'] = tr.select('td.From') 
    flight_data['To'] = tr.select('td.To')

    print (flight_data)

except AttributeError as e:
     raise ValueError("No valid table found")

我更改了代码的最后一部分以形成数据对象,但我似乎无法获取数据。

最终编辑:

import requests
import bs4

root_url = 'http://www.flightradar24.com'
index_url = root_url + '/data/flights/tigerair-tgw/'


def get_flight_id_urls():
     response = requests.get(index_url)
     soup = bs4.BeautifulSoup(response.text)
     return [a.attrs.get('href') for a in soup.select('div.list-group a[href^=/data]')]


flight_id_urls = get_flight_id_urls()
for flight_id_url in flight_id_urls:
    temp_url = root_url + flight_id_url
    response = requests.get(temp_url)
    soup = bs4.BeautifulSoup(response.text)

try:
    table = soup.find('table')
    rows = table.find_all('tr')
    for row in rows:
        flight_data = {}
        flight_data['flight_number'] = tr['data-flight-number']
        flight_data['from'] = tr['data-name-from']
        print (flight_data)

except AttributeError as e:
    raise ValueError("No valid table found")

P.S.S:感谢@amow 的大力帮助 :D

table 开始作为您在 html 中的 table。

heads = [i.text.strip() for i in table.select('thead th')]
for tr in table.select('tbody tr'):
    datas = [i.text.strip() for i in tr.select('td')]
    print dict(zip(heads, datas))

输出

{   
    u'STD': u'06:30',   
    u'Status': u'Scheduled',   
    u'ATD': u'-',  
    u'From': u'Singapore  (SIN)',  
    u'STA': u'07:55',  
    u'\xa0': u'', #This is the last column and have no meaning  
    u'To': u'Penang  (PEN)',  
    u'Aircraft': u'-',  
    u'Date': u'2015-04-19'
}

如果你想获取tr标签中的数据。只需使用

tr['data-data'] tr['data-flight-number']

等等。