是否可以将抓取应用到这个正在积极重新计算的页面?
Can scraping be applied to this page which is actively recalculating?
我想从下面的页面获取卫星位置,但我不确定抓取是否合适,因为该页面似乎每秒都在使用一些内部代码进行自我更新(它在我之后不断更新断开与互联网的连接)。可以在 Space Stackexchange 的问题中找到背景信息:A nicer way to download the positions of the Orbcomm-2 satellites.
我需要 "snapshot" 四件商品同时:
- UTC 时间
- 纬度
- 经度
- 海拔高度
现在我使用屏幕截图和手动输入。由于这些值正在由页面更新 - 传统的网络抓取是否可以在这里工作?我找到了一个 "screen-scraping" 标签,我应该尝试了解它吗?
我正在寻找获得这四个值的最简单的解决方案,我想知道我是否可以只使用 urllib
或 urllib2
而避免安装新的东西?
示例页面:http://www.satview.org/?sat_id=41186U我需要完成 41179U 到 41189U(SpaceX 刚刚送入轨道的 11 颗 Orbcomm-2 卫星)
一个选择是启动真正的浏览器并在无限循环中不断轮询位置:
import time
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.satview.org/?sat_id=41186U")
while True:
location = driver.find_element_by_css_selector("#sat_latlon .texto_track2").text
latitude, longitude = location.split("\n")[:2]
print(latitude, longitude)
time.sleep(1)
示例输出:
(u'-16.57', u'66.63')
(u'-16.61', u'66.67')
...
这里我们使用selenium
and Firefox - there are multiple drivers for different browsers including headless, like PhantomJS
.
不需要刮。只需查看该页面的来源 html 和 copy/paste javascript 代码。 None 的位置是远程获取的...它们都是在页面中动态计算的。因此,只需获取代码并 运行 自己!
Space-Track.org 的 REST API 似乎是为处理此类请求而构建的。在那里拥有帐户后,您甚至可以下载示例脚本(在此处更新)来下载 TLES:
# STTest.py
#
# Simple Python app to extract resident space object history data from www.space-track.org into a spreadsheet
# (prior to executing, register for a free personal account at https://www.space-track.org/auth/createAccount)
#
# This program is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free Software Foundation,
# either version 3 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
# without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# See the GNU General Public License for more details.
#
# For full licencing terms, please refer to the GNU General Public License (gpl-3_0.txt) distributed with this release,
# or see http://www.gnu.org/licenses/gpl-3.0.html.
import requests
import json
import xlsxwriter
import time
import datetime
import getpass
import sys
class MyError(Exception):
def __init___(self, args):
Exception.__init__(self, "my exception was raised with arguments {0}".format(args))
self.args = args
# See https://www.space-track.org/documentation for details on REST queries
# "Find Starlinks" query finds all satellites w/ NORAD_CAT_ID > 40000 & OBJECT_NAME matching STARLINK*, 1 line per sat;
# the "OMM Starlink" query gets all Orbital Mean-Elements Messages (OMM) for a specific NORAD_CAT_ID in JSON format.
uriBase = "https://www.space-track.org"
requestLogin = "/ajaxauth/login"
requestCmdAction = "/basicspacedata/query"
requestFindStarlinks = "/class/tle_latest/NORAD_CAT_ID/>40000/ORDINAL/1/OBJECT_NAME/STARLINK~~/format/json/orderby/NORAD_CAT_ID%20asc"
requestOMMStarlink1 = "/class/omm/NORAD_CAT_ID/"
requestOMMStarlink2 = "/orderby/EPOCH%20asc/format/json"
# Parameters to derive apoapsis and periapsis from mean motion (see https://en.wikipedia.org/wiki/Mean_motion)
GM = 398600441800000.0
GM13 = GM ** (1.0 / 3.0)
MRAD = 6378.137
PI = 3.14159265358979
TPI86 = 2.0 * PI / 86400.0
# Log in to personal account obtained by registering for free at https://www.space-track.org/auth/createAccount
print('\nEnter your personal Space-Track.org username (usually your email address for registration): ')
configUsr = input()
print('Username capture complete.\n')
configPwd = getpass.getpass(prompt='Securely enter your Space-Track.org password (minimum of 15 characters): ')
# Excel Output file name - e.g. starlink-track.xlsx (note: make it an .xlsx file)
configOut = 'STText.xslx'
siteCred = {'identity': configUsr, 'password': configPwd}
# User xlsxwriter package to write the .xlsx file
print('Creating Microsoft Excel (.xlsx) file to contain outputs...')
workbook = xlsxwriter.Workbook(configOut)
worksheet = workbook.add_worksheet()
z0_format = workbook.add_format({'num_format': '#,##0'})
z1_format = workbook.add_format({'num_format': '#,##0.0'})
z2_format = workbook.add_format({'num_format': '#,##0.00'})
z3_format = workbook.add_format({'num_format': '#,##0.000'})
# write the headers on the spreadsheet
print('Starting to write outputs to Excel file create...')
now = datetime.datetime.now()
nowStr = now.strftime("%m/%d/%Y %H:%M:%S")
worksheet.write('A1', 'Starlink data from' + uriBase + " on " + nowStr)
worksheet.write('A3', 'NORAD_CAT_ID')
worksheet.write('B3', 'SATNAME')
worksheet.write('C3', 'EPOCH')
worksheet.write('D3', 'Orb')
worksheet.write('E3', 'Inc')
worksheet.write('F3', 'Ecc')
worksheet.write('G3', 'MnM')
worksheet.write('H3', 'ApA')
worksheet.write('I3', 'PeA')
worksheet.write('J3', 'AvA')
worksheet.write('K3', 'LAN')
worksheet.write('L3', 'AgP')
worksheet.write('M3', 'MnA')
worksheet.write('N3', 'SMa')
worksheet.write('O3', 'T')
worksheet.write('P3', 'Vel')
wsline = 3
def countdown(t, step=1, msg='Sleeping...'): # in seconds
pad_str = ' ' * len('%d' % step)
for i in range(t, 0, -step):
sys.stdout.write('{} for the next {} seconds {}\r'.format(msg, i, pad_str))
sys.stdout.flush()
time.sleep(step)
print('Done {} for {} seconds! {}'.format(msg, t, pad_str))
# use requests package to drive the RESTful session with space-track.org
print('Interfacing with SpaceTrack.org to obtain data...')
with requests.Session() as session:
# Need to log in first. NOTE: we get a 200 to say the web site got the data, not that we are logged in.
resp = session.post(uriBase + requestLogin, data=siteCred)
if resp.status_code != 200:
raise MyError(resp, "POST fail on login.")
# This query picks up all Starlink satellites from the catalog. NOTE: a 401 failure shows you have bad credentials.
resp = session.get(uriBase + requestCmdAction + requestFindStarlinks)
if resp.status_code != 200:
print(resp)
raise MyError(resp, "GET fail on request for resident space objects.")
# Use json package to break json-formatted response into a Python structure (a list of dictionaries)
retData = json.loads(resp.text)
satCount = len(retData)
satIds = []
for e in retData:
# each e describes the latest elements for one resident space object. We just need the NORAD_CAT_ID...
catId = e['NORAD_CAT_ID']
satIds.append(catId)
# Using our new list of resident space object NORAD_CAT_IDs, we can now get the OMM message
maxs = 1 # counter for number of sessions we have established without a pause in querying space-track.org
for s in satIds:
resp = session.get(uriBase + requestCmdAction + requestOMMStarlink1 + s + requestOMMStarlink2)
if resp.status_code != 200:
# If you are getting error 500's here, its probably the rate throttle on the site (20/min and 200/hr)
# wait a while and retry
print(resp)
raise MyError(resp, "GET fail on request for resident space object number " + s + '.')
# the data here can be quite large, as it's all the elements for every entry for one resident space object
retData = json.loads(resp.text)
for e in retData:
# each element is one reading of the orbital elements for one resident space object
print("Scanning satellite " + e['OBJECT_NAME'] + " at epoch " + e['EPOCH'] + '...')
mmoti = float(e['MEAN_MOTION'])
ecc = float(e['ECCENTRICITY'])
worksheet.write(wsline, 0, int(e['NORAD_CAT_ID']))
worksheet.write(wsline, 1, e['OBJECT_NAME'])
worksheet.write(wsline, 2, e['EPOCH'])
worksheet.write(wsline, 3, float(e['REV_AT_EPOCH']))
worksheet.write(wsline, 4, float(e['INCLINATION']), z1_format)
worksheet.write(wsline, 5, ecc, z3_format)
worksheet.write(wsline, 6, mmoti, z1_format)
# do some ninja-fu to flip Mean Motion into Apoapsis and Periapsis, and to get orbital period and velocity
sma = GM13 / ((TPI86 * mmoti) ** (2.0 / 3.0)) / 1000.0
apo = sma * (1.0 + ecc) - MRAD
per = sma * (1.0 - ecc) - MRAD
smak = sma * 1000.0
orbT = 2.0 * PI * ((smak ** 3.0) / GM) ** (0.5)
orbV = (GM / smak) ** (0.5)
worksheet.write(wsline, 7, apo, z1_format)
worksheet.write(wsline, 8, per, z1_format)
worksheet.write(wsline, 9, (apo + per) / 2.0, z1_format)
worksheet.write(wsline, 10, float(e['RA_OF_ASC_NODE']), z1_format)
worksheet.write(wsline, 11, float(e['ARG_OF_PERICENTER']), z1_format)
worksheet.write(wsline, 12, float(e['MEAN_ANOMALY']), z1_format)
worksheet.write(wsline, 13, sma, z1_format)
worksheet.write(wsline, 14, orbT, z0_format)
worksheet.write(wsline, 15, orbV, z0_format)
wsline = wsline + 1
maxs = maxs + 1
print(str(maxs))
if maxs > 18:
print('\nSnoozing for 60 secs for rate limit reasons (max 20/min and 200/hr).')
countdown(60)
maxs = 1
session.close()
workbook.close()
print('\nCompleted session.')
我想从下面的页面获取卫星位置,但我不确定抓取是否合适,因为该页面似乎每秒都在使用一些内部代码进行自我更新(它在我之后不断更新断开与互联网的连接)。可以在 Space Stackexchange 的问题中找到背景信息:A nicer way to download the positions of the Orbcomm-2 satellites.
我需要 "snapshot" 四件商品同时:
- UTC 时间
- 纬度
- 经度
- 海拔高度
现在我使用屏幕截图和手动输入。由于这些值正在由页面更新 - 传统的网络抓取是否可以在这里工作?我找到了一个 "screen-scraping" 标签,我应该尝试了解它吗?
我正在寻找获得这四个值的最简单的解决方案,我想知道我是否可以只使用 urllib
或 urllib2
而避免安装新的东西?
示例页面:http://www.satview.org/?sat_id=41186U我需要完成 41179U 到 41189U(SpaceX 刚刚送入轨道的 11 颗 Orbcomm-2 卫星)
一个选择是启动真正的浏览器并在无限循环中不断轮询位置:
import time
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.satview.org/?sat_id=41186U")
while True:
location = driver.find_element_by_css_selector("#sat_latlon .texto_track2").text
latitude, longitude = location.split("\n")[:2]
print(latitude, longitude)
time.sleep(1)
示例输出:
(u'-16.57', u'66.63')
(u'-16.61', u'66.67')
...
这里我们使用selenium
and Firefox - there are multiple drivers for different browsers including headless, like PhantomJS
.
不需要刮。只需查看该页面的来源 html 和 copy/paste javascript 代码。 None 的位置是远程获取的...它们都是在页面中动态计算的。因此,只需获取代码并 运行 自己!
Space-Track.org 的 REST API 似乎是为处理此类请求而构建的。在那里拥有帐户后,您甚至可以下载示例脚本(在此处更新)来下载 TLES:
# STTest.py
#
# Simple Python app to extract resident space object history data from www.space-track.org into a spreadsheet
# (prior to executing, register for a free personal account at https://www.space-track.org/auth/createAccount)
#
# This program is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free Software Foundation,
# either version 3 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
# without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# See the GNU General Public License for more details.
#
# For full licencing terms, please refer to the GNU General Public License (gpl-3_0.txt) distributed with this release,
# or see http://www.gnu.org/licenses/gpl-3.0.html.
import requests
import json
import xlsxwriter
import time
import datetime
import getpass
import sys
class MyError(Exception):
def __init___(self, args):
Exception.__init__(self, "my exception was raised with arguments {0}".format(args))
self.args = args
# See https://www.space-track.org/documentation for details on REST queries
# "Find Starlinks" query finds all satellites w/ NORAD_CAT_ID > 40000 & OBJECT_NAME matching STARLINK*, 1 line per sat;
# the "OMM Starlink" query gets all Orbital Mean-Elements Messages (OMM) for a specific NORAD_CAT_ID in JSON format.
uriBase = "https://www.space-track.org"
requestLogin = "/ajaxauth/login"
requestCmdAction = "/basicspacedata/query"
requestFindStarlinks = "/class/tle_latest/NORAD_CAT_ID/>40000/ORDINAL/1/OBJECT_NAME/STARLINK~~/format/json/orderby/NORAD_CAT_ID%20asc"
requestOMMStarlink1 = "/class/omm/NORAD_CAT_ID/"
requestOMMStarlink2 = "/orderby/EPOCH%20asc/format/json"
# Parameters to derive apoapsis and periapsis from mean motion (see https://en.wikipedia.org/wiki/Mean_motion)
GM = 398600441800000.0
GM13 = GM ** (1.0 / 3.0)
MRAD = 6378.137
PI = 3.14159265358979
TPI86 = 2.0 * PI / 86400.0
# Log in to personal account obtained by registering for free at https://www.space-track.org/auth/createAccount
print('\nEnter your personal Space-Track.org username (usually your email address for registration): ')
configUsr = input()
print('Username capture complete.\n')
configPwd = getpass.getpass(prompt='Securely enter your Space-Track.org password (minimum of 15 characters): ')
# Excel Output file name - e.g. starlink-track.xlsx (note: make it an .xlsx file)
configOut = 'STText.xslx'
siteCred = {'identity': configUsr, 'password': configPwd}
# User xlsxwriter package to write the .xlsx file
print('Creating Microsoft Excel (.xlsx) file to contain outputs...')
workbook = xlsxwriter.Workbook(configOut)
worksheet = workbook.add_worksheet()
z0_format = workbook.add_format({'num_format': '#,##0'})
z1_format = workbook.add_format({'num_format': '#,##0.0'})
z2_format = workbook.add_format({'num_format': '#,##0.00'})
z3_format = workbook.add_format({'num_format': '#,##0.000'})
# write the headers on the spreadsheet
print('Starting to write outputs to Excel file create...')
now = datetime.datetime.now()
nowStr = now.strftime("%m/%d/%Y %H:%M:%S")
worksheet.write('A1', 'Starlink data from' + uriBase + " on " + nowStr)
worksheet.write('A3', 'NORAD_CAT_ID')
worksheet.write('B3', 'SATNAME')
worksheet.write('C3', 'EPOCH')
worksheet.write('D3', 'Orb')
worksheet.write('E3', 'Inc')
worksheet.write('F3', 'Ecc')
worksheet.write('G3', 'MnM')
worksheet.write('H3', 'ApA')
worksheet.write('I3', 'PeA')
worksheet.write('J3', 'AvA')
worksheet.write('K3', 'LAN')
worksheet.write('L3', 'AgP')
worksheet.write('M3', 'MnA')
worksheet.write('N3', 'SMa')
worksheet.write('O3', 'T')
worksheet.write('P3', 'Vel')
wsline = 3
def countdown(t, step=1, msg='Sleeping...'): # in seconds
pad_str = ' ' * len('%d' % step)
for i in range(t, 0, -step):
sys.stdout.write('{} for the next {} seconds {}\r'.format(msg, i, pad_str))
sys.stdout.flush()
time.sleep(step)
print('Done {} for {} seconds! {}'.format(msg, t, pad_str))
# use requests package to drive the RESTful session with space-track.org
print('Interfacing with SpaceTrack.org to obtain data...')
with requests.Session() as session:
# Need to log in first. NOTE: we get a 200 to say the web site got the data, not that we are logged in.
resp = session.post(uriBase + requestLogin, data=siteCred)
if resp.status_code != 200:
raise MyError(resp, "POST fail on login.")
# This query picks up all Starlink satellites from the catalog. NOTE: a 401 failure shows you have bad credentials.
resp = session.get(uriBase + requestCmdAction + requestFindStarlinks)
if resp.status_code != 200:
print(resp)
raise MyError(resp, "GET fail on request for resident space objects.")
# Use json package to break json-formatted response into a Python structure (a list of dictionaries)
retData = json.loads(resp.text)
satCount = len(retData)
satIds = []
for e in retData:
# each e describes the latest elements for one resident space object. We just need the NORAD_CAT_ID...
catId = e['NORAD_CAT_ID']
satIds.append(catId)
# Using our new list of resident space object NORAD_CAT_IDs, we can now get the OMM message
maxs = 1 # counter for number of sessions we have established without a pause in querying space-track.org
for s in satIds:
resp = session.get(uriBase + requestCmdAction + requestOMMStarlink1 + s + requestOMMStarlink2)
if resp.status_code != 200:
# If you are getting error 500's here, its probably the rate throttle on the site (20/min and 200/hr)
# wait a while and retry
print(resp)
raise MyError(resp, "GET fail on request for resident space object number " + s + '.')
# the data here can be quite large, as it's all the elements for every entry for one resident space object
retData = json.loads(resp.text)
for e in retData:
# each element is one reading of the orbital elements for one resident space object
print("Scanning satellite " + e['OBJECT_NAME'] + " at epoch " + e['EPOCH'] + '...')
mmoti = float(e['MEAN_MOTION'])
ecc = float(e['ECCENTRICITY'])
worksheet.write(wsline, 0, int(e['NORAD_CAT_ID']))
worksheet.write(wsline, 1, e['OBJECT_NAME'])
worksheet.write(wsline, 2, e['EPOCH'])
worksheet.write(wsline, 3, float(e['REV_AT_EPOCH']))
worksheet.write(wsline, 4, float(e['INCLINATION']), z1_format)
worksheet.write(wsline, 5, ecc, z3_format)
worksheet.write(wsline, 6, mmoti, z1_format)
# do some ninja-fu to flip Mean Motion into Apoapsis and Periapsis, and to get orbital period and velocity
sma = GM13 / ((TPI86 * mmoti) ** (2.0 / 3.0)) / 1000.0
apo = sma * (1.0 + ecc) - MRAD
per = sma * (1.0 - ecc) - MRAD
smak = sma * 1000.0
orbT = 2.0 * PI * ((smak ** 3.0) / GM) ** (0.5)
orbV = (GM / smak) ** (0.5)
worksheet.write(wsline, 7, apo, z1_format)
worksheet.write(wsline, 8, per, z1_format)
worksheet.write(wsline, 9, (apo + per) / 2.0, z1_format)
worksheet.write(wsline, 10, float(e['RA_OF_ASC_NODE']), z1_format)
worksheet.write(wsline, 11, float(e['ARG_OF_PERICENTER']), z1_format)
worksheet.write(wsline, 12, float(e['MEAN_ANOMALY']), z1_format)
worksheet.write(wsline, 13, sma, z1_format)
worksheet.write(wsline, 14, orbT, z0_format)
worksheet.write(wsline, 15, orbV, z0_format)
wsline = wsline + 1
maxs = maxs + 1
print(str(maxs))
if maxs > 18:
print('\nSnoozing for 60 secs for rate limit reasons (max 20/min and 200/hr).')
countdown(60)
maxs = 1
session.close()
workbook.close()
print('\nCompleted session.')