有人可以解释为什么 .asp 链接会出现奇怪的行为吗?

can someone explain why .asp links giving weird behavior?


顺便说一句,如果您直接转到此 link,您将得到一个空白页面(我不知道为什么)。此外,现在如果我单击一个文件,它将像这样打开

现在 link 中有模式,每次报告编号更改时,其余 link 保持不变,所以我获得了 的所有 link刮。我面临的问题是,当我使用 beautifulSoup 加载 link 时,如果我在浏览器

上加载相同的 link,我将无法获得该报告的相同内容
import bs4 as bs
import urllib.request
import requests

with open("p.html",'r') as f:
  page = f.read()
soup = bs.BeautifulSoup(page,'lxml')
links =soup.find_all('a')
for link in links:


# print(urls)
link1 = string1 + urls[1]

sauce = requests.get(link1)
soup = bs.BeautifulSoup(sauce.content,'lxml')

它是随机的,有时如果复制 link 并将其加载到新选项卡(或其他浏览器)中,它会转换为错误页面,所以我丢失了报告信息,我无法 以这种方式抓取数据,即使我拥有所有报告的所有links。有人可以告诉我发生了什么事吗?谢谢

更新 - link http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public 你必须 select 在右上角选择“否”才能获得“搜索”按钮

抓取 页面,您需要使用requests.session 正确设置cookie。 POST请求中还有参数ud,页面使用需要正确设置。


import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public'
post_url = 'http://delhigovt.nic.in/FireReport/a_publicSearch.asp'

params = {'ud': '',
          'fstation': '',
          'caller': '',
          'add': '',
          'frmdate': '',
          'todate': '',
          'save': 'Search'}

def open_report(s, url):
    url = 'http://delhigovt.nic.in/FireReport/' + url
    soup = BeautifulSoup(s.get(url).content, 'lxml')

    # just return some text here
    return soup.select('body > table')[1].get_text(strip=True, separator=' ')

data = {}
with requests.session() as s:
    soup = BeautifulSoup(s.get(url).content, 'lxml')

    stations = {}
    for option in soup.select('select[name="fstation"] option[value]:not(:contains("Select Fire Station"))'):
        stations[option.get_text(strip=True)] = option['value']

    params['ud'] = soup.select_one('input[name="ud"][value]')['value']

    for k, v in stations.items():
        print('Scraping station {} id={}'.format(k, v))

        params['fstation'] = int(v)
        soup = BeautifulSoup( s.post(post_url, data=params).content, 'lxml' )

        for tr in soup.select('tr:has(> td > a[href^="f_publicReport.asp?rep_no="])'):
            no, fire_report_no, date, address = tr.select('td')
            link = fire_report_no.a['href']

            data.setdefault(k, [])
            data[k].append( (no.get_text(strip=True), fire_report_no.get_text(strip=True), date.get_text(strip=True), address.get_text(strip=True), link, open_report(s, link)) )

# pprint(data) # <-- here is your data


Scraping station Badli id=33
 'Shahbad, Daulat Pur.',
 'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
 'Number&nbsp&nbsp: 200600024 Operational Jurisdiction of Fire Station&nbsp: '
 'Badli Information Received From: PCR Full Address of Incident Place: '
 'Shahbad, Daulat Pur. Date of Receipt of Call&nbsp: Saturday, April 1, 2006 '
 'Time of Receipt of Call \t&nbsp: 17\xa0Hrs\xa0:\xa055\xa0Min Time of '
 'Departure From Fire Station: 17\xa0Hrs\xa0:\xa056\xa0Min Approximate '
 'Distance From Fire Station: 3\xa0\xa0Kilometers Time of Arrival at Fire '
 'Scene: 17\xa0Hrs\xa0:\xa059\xa0Min Nature of Call Fire Date of Leaving From '
 'Fire Scene: 4/1/2006 Time of Leaving From Fire Scene: 18\xa0Hrs\xa0:\xa0'
 '30\xa0Min Type of Occupancy: Others Occupancy Details in Case of Others: '
 'NDPL Category of Fire: Small Type of Building: Low Rise Details of Affected '
 'Area: Fire was in electrical wiring. Divisional Officer Delhi Fire Service '
 'Disclaimer: This is a computer generated report.\r\n'
 'Neither department nor its associates, information providers or content '
 'providers warrant or guarantee the timeliness, sequence, accuracy or '
 'completeness of this information.')
 'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi.',
 'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
 'Number&nbsp&nbsp: 200600161 Operational Jurisdiction of Fire Station&nbsp: '
 'Badli Information Received From: PCR Full Address of Incident Place: '
 'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi. Date of '
 'Receipt of Call&nbsp: Wednesday, April 5, 2006 Time of Receipt of Call \t'
 '&nbsp: 19\xa0Hrs\xa0:\xa010\xa0Min Time of Departure From Fire Station: '
 '19\xa0Hrs\xa0:\xa011\xa0Min Approximate Distance From Fire Station: '
 '1.5\xa0\xa0Kilometers Time of Arrival at Fire Scene: 19\xa0Hrs\xa0:\xa013\xa0'
 'Min Nature of Call Fire Date of Leaving From Fire Scene: 4/5/2006 Time of '
 'Leaving From Fire Scene: 20\xa0Hrs\xa0:\xa050\xa0Min Type of Occupancy: '
 'Others Occupancy Details in Case of Others: MCD Category of Fire: Small Type '
 'of Building: Others Building Details in Case of Others: On Road Details of '
 'Affected Area: Fire was in Rubbish and dry tree on road. Divisional Officer '
 'Delhi Fire Service Disclaimer: This is a computer generated report.\r\n'
 'Neither department nor its associates, information providers or content '
 'providers warrant or guarantee the timeliness, sequence, accuracy or '
 'completeness of this information.')

...and so on.