有人可以解释为什么 .asp 链接会出现奇怪的行为吗？

Question

我对后端编程知之甚少，我想为我的学术项目抓取德里消防局的数据，德里地区有在线火灾报告.每个区域都有大量可用文件

顺便说一句，如果您直接转到此 link，您将得到一个空白页面（我不知道为什么）。此外，现在如果我单击一个文件，它将像这样打开

现在 link 中有模式，每次报告编号更改时，其余 link 保持不变，所以我获得了 的所有 link刮。我面临的问题是，当我使用 beautifulSoup 加载 link 时，如果我在浏览器

上加载相同的 link，我将无法获得该报告的相同内容

import bs4 as bs
import urllib.request
import requests

with open("p.html",'r') as f:
  page = f.read()
soup = bs.BeautifulSoup(page,'lxml')
links =soup.find_all('a')
urls=[]
for link in links:
  urls.append(link.get('href'))

string1="http://delhigovt.nic.in/FireReport/"

# print(urls)
link1 = string1 + urls[1]
print(link1)

sauce = requests.get(link1)
soup = bs.BeautifulSoup(sauce.content,'lxml')
print(soup)

它是随机的，有时如果复制 link 并将其加载到新选项卡（或其他浏览器）中，它会转换为错误页面，所以我丢失了报告信息，我无法 以这种方式抓取数据，即使我拥有所有报告的所有links。有人可以告诉我发生了什么事吗？谢谢

更新 - link http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public 你必须 select 在右上角选择“否”才能获得“搜索”按钮

Answer 1

要抓取页面，您需要使用requests.session 正确设置cookie。 POST请求中还有参数ud，页面使用需要正确设置。

例如（这会抓取所有站和报告并将其存储在字典中data）：

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public'
post_url = 'http://delhigovt.nic.in/FireReport/a_publicSearch.asp'

params = {'ud': '',
          'fstation': '',
          'caller': '',
          'add': '',
          'frmdate': '',
          'todate': '',
          'save': 'Search'}

def open_report(s, url):
    url = 'http://delhigovt.nic.in/FireReport/' + url
    print(url)
    soup = BeautifulSoup(s.get(url).content, 'lxml')

    # just return some text here
    return soup.select('body > table')[1].get_text(strip=True, separator=' ')

data = {}
with requests.session() as s:
    soup = BeautifulSoup(s.get(url).content, 'lxml')

    stations = {}
    for option in soup.select('select[name="fstation"] option[value]:not(:contains("Select Fire Station"))'):
        stations[option.get_text(strip=True)] = option['value']

    params['ud'] = soup.select_one('input[name="ud"][value]')['value']

    for k, v in stations.items():
        print('Scraping station {} id={}'.format(k, v))

        params['fstation'] = int(v)
        soup = BeautifulSoup( s.post(post_url, data=params).content, 'lxml' )

        for tr in soup.select('tr:has(> td > a[href^="f_publicReport.asp?rep_no="])'):
            no, fire_report_no, date, address = tr.select('td')
            link = fire_report_no.a['href']

            data.setdefault(k, [])
            data[k].append( (no.get_text(strip=True), fire_report_no.get_text(strip=True), date.get_text(strip=True), address.get_text(strip=True), link, open_report(s, link)) )
            pprint(data[k][-1])

# pprint(data) # <-- here is your data

打印：

Scraping station Badli id=33
http://delhigovt.nic.in/FireReport/f_publicReport.asp?rep_no=200600024&ud=6668
('1',
 '200600024',
 '1-Apr-2006',
 'Shahbad, Daulat Pur.',
 'f_publicReport.asp?rep_no=200600024&ud=6668',
 'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
 'Number&nbsp&nbsp: 200600024 Operational Jurisdiction of Fire Station&nbsp: '
 'Badli Information Received From: PCR Full Address of Incident Place: '
 'Shahbad, Daulat Pur. Date of Receipt of Call&nbsp: Saturday, April 1, 2006 '
 'Time of Receipt of Call \t&nbsp: 17\xa0Hrs\xa0:\xa055\xa0Min Time of '
 'Departure From Fire Station: 17\xa0Hrs\xa0:\xa056\xa0Min Approximate '
 'Distance From Fire Station: 3\xa0\xa0Kilometers Time of Arrival at Fire '
 'Scene: 17\xa0Hrs\xa0:\xa059\xa0Min Nature of Call Fire Date of Leaving From '
 'Fire Scene: 4/1/2006 Time of Leaving From Fire Scene: 18\xa0Hrs\xa0:\xa0'
 '30\xa0Min Type of Occupancy: Others Occupancy Details in Case of Others: '
 'NDPL Category of Fire: Small Type of Building: Low Rise Details of Affected '
 'Area: Fire was in electrical wiring. Divisional Officer Delhi Fire Service '
 'Disclaimer: This is a computer generated report.\r\n'
 'Neither department nor its associates, information providers or content '
 'providers warrant or guarantee the timeliness, sequence, accuracy or '
 'completeness of this information.')
http://delhigovt.nic.in/FireReport/f_publicReport.asp?rep_no=200600161&ud=6668
('2',
 '200600161',
 '5-Apr-2006',
 'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi.',
 'f_publicReport.asp?rep_no=200600161&ud=6668',
 'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
 'Number&nbsp&nbsp: 200600161 Operational Jurisdiction of Fire Station&nbsp: '
 'Badli Information Received From: PCR Full Address of Incident Place: '
 'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi. Date of '
 'Receipt of Call&nbsp: Wednesday, April 5, 2006 Time of Receipt of Call \t'
 '&nbsp: 19\xa0Hrs\xa0:\xa010\xa0Min Time of Departure From Fire Station: '
 '19\xa0Hrs\xa0:\xa011\xa0Min Approximate Distance From Fire Station: '
 '1.5\xa0\xa0Kilometers Time of Arrival at Fire Scene: 19\xa0Hrs\xa0:\xa013\xa0'
 'Min Nature of Call Fire Date of Leaving From Fire Scene: 4/5/2006 Time of '
 'Leaving From Fire Scene: 20\xa0Hrs\xa0:\xa050\xa0Min Type of Occupancy: '
 'Others Occupancy Details in Case of Others: MCD Category of Fire: Small Type '
 'of Building: Others Building Details in Case of Others: On Road Details of '
 'Affected Area: Fire was in Rubbish and dry tree on road. Divisional Officer '
 'Delhi Fire Service Disclaimer: This is a computer generated report.\r\n'
 'Neither department nor its associates, information providers or content '
 'providers warrant or guarantee the timeliness, sequence, accuracy or '
 'completeness of this information.')

...and so on.

有人可以解释为什么 .asp 链接会出现奇怪的行为吗？

can someone explain why .asp links giving weird behavior?

python

beautifulsoup

web-development-server

web-scraping