有人可以解释为什么 .asp 链接会出现奇怪的行为吗?
can someone explain why .asp links giving weird behavior?
我对后端编程知之甚少,我想为我的学术项目抓取德里消防局的数据,德里地区有在线火灾报告.每个区域都有大量可用文件
顺便说一句,如果您直接转到此 link,您将得到一个空白页面(我不知道为什么)。此外,现在如果我单击一个文件,它将像这样打开
现在 link 中有模式,每次报告编号更改时,其余 link 保持不变,所以我获得了 的所有 link刮。我面临的问题是,当我使用 beautifulSoup 加载 link 时,如果我在浏览器
上加载相同的 link,我将无法获得该报告的相同内容
import bs4 as bs
import urllib.request
import requests
with open("p.html",'r') as f:
page = f.read()
soup = bs.BeautifulSoup(page,'lxml')
links =soup.find_all('a')
urls=[]
for link in links:
urls.append(link.get('href'))
string1="http://delhigovt.nic.in/FireReport/"
# print(urls)
link1 = string1 + urls[1]
print(link1)
sauce = requests.get(link1)
soup = bs.BeautifulSoup(sauce.content,'lxml')
print(soup)
它是随机的,有时如果复制 link 并将其加载到新选项卡(或其他浏览器)中,它会转换为错误页面,所以我丢失了报告信息,我无法 以这种方式抓取数据,即使我拥有所有报告的所有links。有人可以告诉我发生了什么事吗?谢谢
更新 - link http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public
你必须 select 在右上角选择“否”才能获得“搜索”按钮
要抓取 页面,您需要使用requests.session
正确设置cookie。 POST请求中还有参数ud
,页面使用需要正确设置。
例如(这会抓取所有站和报告并将其存储在字典中data
):
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url = 'http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public'
post_url = 'http://delhigovt.nic.in/FireReport/a_publicSearch.asp'
params = {'ud': '',
'fstation': '',
'caller': '',
'add': '',
'frmdate': '',
'todate': '',
'save': 'Search'}
def open_report(s, url):
url = 'http://delhigovt.nic.in/FireReport/' + url
print(url)
soup = BeautifulSoup(s.get(url).content, 'lxml')
# just return some text here
return soup.select('body > table')[1].get_text(strip=True, separator=' ')
data = {}
with requests.session() as s:
soup = BeautifulSoup(s.get(url).content, 'lxml')
stations = {}
for option in soup.select('select[name="fstation"] option[value]:not(:contains("Select Fire Station"))'):
stations[option.get_text(strip=True)] = option['value']
params['ud'] = soup.select_one('input[name="ud"][value]')['value']
for k, v in stations.items():
print('Scraping station {} id={}'.format(k, v))
params['fstation'] = int(v)
soup = BeautifulSoup( s.post(post_url, data=params).content, 'lxml' )
for tr in soup.select('tr:has(> td > a[href^="f_publicReport.asp?rep_no="])'):
no, fire_report_no, date, address = tr.select('td')
link = fire_report_no.a['href']
data.setdefault(k, [])
data[k].append( (no.get_text(strip=True), fire_report_no.get_text(strip=True), date.get_text(strip=True), address.get_text(strip=True), link, open_report(s, link)) )
pprint(data[k][-1])
# pprint(data) # <-- here is your data
打印:
Scraping station Badli id=33
http://delhigovt.nic.in/FireReport/f_publicReport.asp?rep_no=200600024&ud=6668
('1',
'200600024',
'1-Apr-2006',
'Shahbad, Daulat Pur.',
'f_publicReport.asp?rep_no=200600024&ud=6668',
'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
'Number  : 200600024 Operational Jurisdiction of Fire Station : '
'Badli Information Received From: PCR Full Address of Incident Place: '
'Shahbad, Daulat Pur. Date of Receipt of Call : Saturday, April 1, 2006 '
'Time of Receipt of Call \t : 17\xa0Hrs\xa0:\xa055\xa0Min Time of '
'Departure From Fire Station: 17\xa0Hrs\xa0:\xa056\xa0Min Approximate '
'Distance From Fire Station: 3\xa0\xa0Kilometers Time of Arrival at Fire '
'Scene: 17\xa0Hrs\xa0:\xa059\xa0Min Nature of Call Fire Date of Leaving From '
'Fire Scene: 4/1/2006 Time of Leaving From Fire Scene: 18\xa0Hrs\xa0:\xa0'
'30\xa0Min Type of Occupancy: Others Occupancy Details in Case of Others: '
'NDPL Category of Fire: Small Type of Building: Low Rise Details of Affected '
'Area: Fire was in electrical wiring. Divisional Officer Delhi Fire Service '
'Disclaimer: This is a computer generated report.\r\n'
'Neither department nor its associates, information providers or content '
'providers warrant or guarantee the timeliness, sequence, accuracy or '
'completeness of this information.')
http://delhigovt.nic.in/FireReport/f_publicReport.asp?rep_no=200600161&ud=6668
('2',
'200600161',
'5-Apr-2006',
'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi.',
'f_publicReport.asp?rep_no=200600161&ud=6668',
'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
'Number  : 200600161 Operational Jurisdiction of Fire Station : '
'Badli Information Received From: PCR Full Address of Incident Place: '
'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi. Date of '
'Receipt of Call : Wednesday, April 5, 2006 Time of Receipt of Call \t'
' : 19\xa0Hrs\xa0:\xa010\xa0Min Time of Departure From Fire Station: '
'19\xa0Hrs\xa0:\xa011\xa0Min Approximate Distance From Fire Station: '
'1.5\xa0\xa0Kilometers Time of Arrival at Fire Scene: 19\xa0Hrs\xa0:\xa013\xa0'
'Min Nature of Call Fire Date of Leaving From Fire Scene: 4/5/2006 Time of '
'Leaving From Fire Scene: 20\xa0Hrs\xa0:\xa050\xa0Min Type of Occupancy: '
'Others Occupancy Details in Case of Others: MCD Category of Fire: Small Type '
'of Building: Others Building Details in Case of Others: On Road Details of '
'Affected Area: Fire was in Rubbish and dry tree on road. Divisional Officer '
'Delhi Fire Service Disclaimer: This is a computer generated report.\r\n'
'Neither department nor its associates, information providers or content '
'providers warrant or guarantee the timeliness, sequence, accuracy or '
'completeness of this information.')
...and so on.
我对后端编程知之甚少,我想为我的学术项目抓取德里消防局的数据,德里地区有在线火灾报告.每个区域都有大量可用文件
顺便说一句,如果您直接转到此 link,您将得到一个空白页面(我不知道为什么)。此外,现在如果我单击一个文件,它将像这样打开
现在 link 中有模式,每次报告编号更改时,其余 link 保持不变,所以我获得了 的所有 link刮。我面临的问题是,当我使用 beautifulSoup 加载 link 时,如果我在浏览器
上加载相同的 link,我将无法获得该报告的相同内容import bs4 as bs
import urllib.request
import requests
with open("p.html",'r') as f:
page = f.read()
soup = bs.BeautifulSoup(page,'lxml')
links =soup.find_all('a')
urls=[]
for link in links:
urls.append(link.get('href'))
string1="http://delhigovt.nic.in/FireReport/"
# print(urls)
link1 = string1 + urls[1]
print(link1)
sauce = requests.get(link1)
soup = bs.BeautifulSoup(sauce.content,'lxml')
print(soup)
它是随机的,有时如果复制 link 并将其加载到新选项卡(或其他浏览器)中,它会转换为错误页面,所以我丢失了报告信息,我无法 以这种方式抓取数据,即使我拥有所有报告的所有links。有人可以告诉我发生了什么事吗?谢谢
更新 - link http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public 你必须 select 在右上角选择“否”才能获得“搜索”按钮
要抓取 页面,您需要使用requests.session
正确设置cookie。 POST请求中还有参数ud
,页面使用需要正确设置。
例如(这会抓取所有站和报告并将其存储在字典中data
):
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url = 'http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public'
post_url = 'http://delhigovt.nic.in/FireReport/a_publicSearch.asp'
params = {'ud': '',
'fstation': '',
'caller': '',
'add': '',
'frmdate': '',
'todate': '',
'save': 'Search'}
def open_report(s, url):
url = 'http://delhigovt.nic.in/FireReport/' + url
print(url)
soup = BeautifulSoup(s.get(url).content, 'lxml')
# just return some text here
return soup.select('body > table')[1].get_text(strip=True, separator=' ')
data = {}
with requests.session() as s:
soup = BeautifulSoup(s.get(url).content, 'lxml')
stations = {}
for option in soup.select('select[name="fstation"] option[value]:not(:contains("Select Fire Station"))'):
stations[option.get_text(strip=True)] = option['value']
params['ud'] = soup.select_one('input[name="ud"][value]')['value']
for k, v in stations.items():
print('Scraping station {} id={}'.format(k, v))
params['fstation'] = int(v)
soup = BeautifulSoup( s.post(post_url, data=params).content, 'lxml' )
for tr in soup.select('tr:has(> td > a[href^="f_publicReport.asp?rep_no="])'):
no, fire_report_no, date, address = tr.select('td')
link = fire_report_no.a['href']
data.setdefault(k, [])
data[k].append( (no.get_text(strip=True), fire_report_no.get_text(strip=True), date.get_text(strip=True), address.get_text(strip=True), link, open_report(s, link)) )
pprint(data[k][-1])
# pprint(data) # <-- here is your data
打印:
Scraping station Badli id=33
http://delhigovt.nic.in/FireReport/f_publicReport.asp?rep_no=200600024&ud=6668
('1',
'200600024',
'1-Apr-2006',
'Shahbad, Daulat Pur.',
'f_publicReport.asp?rep_no=200600024&ud=6668',
'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
'Number  : 200600024 Operational Jurisdiction of Fire Station : '
'Badli Information Received From: PCR Full Address of Incident Place: '
'Shahbad, Daulat Pur. Date of Receipt of Call : Saturday, April 1, 2006 '
'Time of Receipt of Call \t : 17\xa0Hrs\xa0:\xa055\xa0Min Time of '
'Departure From Fire Station: 17\xa0Hrs\xa0:\xa056\xa0Min Approximate '
'Distance From Fire Station: 3\xa0\xa0Kilometers Time of Arrival at Fire '
'Scene: 17\xa0Hrs\xa0:\xa059\xa0Min Nature of Call Fire Date of Leaving From '
'Fire Scene: 4/1/2006 Time of Leaving From Fire Scene: 18\xa0Hrs\xa0:\xa0'
'30\xa0Min Type of Occupancy: Others Occupancy Details in Case of Others: '
'NDPL Category of Fire: Small Type of Building: Low Rise Details of Affected '
'Area: Fire was in electrical wiring. Divisional Officer Delhi Fire Service '
'Disclaimer: This is a computer generated report.\r\n'
'Neither department nor its associates, information providers or content '
'providers warrant or guarantee the timeliness, sequence, accuracy or '
'completeness of this information.')
http://delhigovt.nic.in/FireReport/f_publicReport.asp?rep_no=200600161&ud=6668
('2',
'200600161',
'5-Apr-2006',
'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi.',
'f_publicReport.asp?rep_no=200600161&ud=6668',
'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
'Number  : 200600161 Operational Jurisdiction of Fire Station : '
'Badli Information Received From: PCR Full Address of Incident Place: '
'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi. Date of '
'Receipt of Call : Wednesday, April 5, 2006 Time of Receipt of Call \t'
' : 19\xa0Hrs\xa0:\xa010\xa0Min Time of Departure From Fire Station: '
'19\xa0Hrs\xa0:\xa011\xa0Min Approximate Distance From Fire Station: '
'1.5\xa0\xa0Kilometers Time of Arrival at Fire Scene: 19\xa0Hrs\xa0:\xa013\xa0'
'Min Nature of Call Fire Date of Leaving From Fire Scene: 4/5/2006 Time of '
'Leaving From Fire Scene: 20\xa0Hrs\xa0:\xa050\xa0Min Type of Occupancy: '
'Others Occupancy Details in Case of Others: MCD Category of Fire: Small Type '
'of Building: Others Building Details in Case of Others: On Road Details of '
'Affected Area: Fire was in Rubbish and dry tree on road. Divisional Officer '
'Delhi Fire Service Disclaimer: This is a computer generated report.\r\n'
'Neither department nor its associates, information providers or content '
'providers warrant or guarantee the timeliness, sequence, accuracy or '
'completeness of this information.')
...and so on.