网络抓取大学时间表一直给我 'charmap' 错误

web-scraping college timetable keeps giving me 'charmap' error

我正在尝试为一个项目进行网络抓取,并在时间表上找到空闲时间。当我 运行 此代码时,出现以下错误“'charmap' 编解码器无法解码位置 32078 中的字节 0x8d:字符映射到” 这是我的代码

url = "https://opentimetable.dcu.ie/"
response = requests.get(url)

with open('webpage.html', 'r') as html_file:

content = html_file.read()

感谢任何帮助

您没有得到任何东西的原因是因为此数据是动态呈现的。您需要 select 不同的参数来查询您要查询的内容,它不会出现在简单的静态请求中。

因此,有一个 api 可以让您选择按不同类别进行搜索。唯一值最少的类别是 "Location",所以我选择了它。

这将获取所有位置 ID,然后将其输入过滤器以查找每个位置在什么时间预订了什么。

您在 table 中有预订的开始和结束时间(和日期)。我将留给您解析该信息以查找何时开放的日期或时间。我只想做的是 python 创建一个列表,其中包含所有唯一的日期和时间开始时间以及时间结束时间,对其进行排序,然后找到 gaps/no 重叠的位置。

import requests
import pandas as pd

s = requests.Session()

url = "https://opentimetable.dcu.ie/broker/api/categoryTypeOptions"
s.get(url)
cookies = s.cookies.get_dict()

cookieStr = ''
for k, v in cookies.items():
    cookieStr += f'{k}={v};'

headers = {
'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,en-GB;q=0.8',
'Authorization': 'basic T64Mdy7m[',
'Connection': 'keep-alive',
'Content-Type': 'application/json',
'Cookie': cookieStr,
'Host': 'opentimetable.dcu.ie',
'Referer': 'https://opentimetable.dcu.ie/',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}



        
    
url = 'https://opentimetable.dcu.ie/broker/api/CategoryTypes/1e042cb1-547d-41d4-ae93-a1f2c3d34538/Categories/Filter?pageNumber=1'
payload = '[{"Identity": "6359fd0c-1bbe-496a-8998-4fefc5cd18de","Values": ["null"]}]' 
jsonData = s.post(url, headers=headers, data=payload).json()

totalPages = jsonData['TotalPages']

print('Page: 1 of %s' %totalPages)
locationList = jsonData['Results']

for page in range(2, totalPages+1):
    print('Page: %s of %s' %(page,totalPages))
    url = 'https://opentimetable.dcu.ie/broker/api/CategoryTypes/1e042cb1-547d-41d4-ae93-a1f2c3d34538/Categories/Filter?pageNumber=%s' %(page)
    jsonData = s.post(url, headers=headers, data=payload).json()
    locationList += jsonData['Results']

locationList = [x['Identity'] for x in locationList]


def update_payload(listOfLocations):
    true = "true"
    false = "false"
    
    payload = {
      "ViewOptions": {
        "Days": [
          {
            "Name": "Monday",
            "DayOfWeek": 1,
            "IsDefault": true
          },
          {
            "Name": "Tuesday",
            "DayOfWeek": 2,
            "IsDefault": true
          },
          {
            "Name": "Wednesday",
            "DayOfWeek": 3,
            "IsDefault": true
          },
          {
            "Name": "Thursday",
            "DayOfWeek": 4,
            "IsDefault": true
          },
          {
            "Name": "Friday",
            "DayOfWeek": 5,
            "IsDefault": true
          }
        ],
        "Weeks": [
          {
            "WeekNumber": 21,
            "WeekLabel": "21",
            "FirstDayInWeek": "2022-02-07T00:00:00.000Z"
          }
        ],
        "TimePeriods": [
          {
            "Description": "All Day",
            "StartTime": "08:00",
            "EndTime": "22:00",
            "IsDefault": true
          }
        ],
        "DatePeriods": [
          {
            "Description": "This Week",
            "StartDateTime": "2021-09-20T00:00:00.000Z",
            "EndDateTime": "2022-09-20T00:00:00.000Z",
            "IsDefault": true,
            "IsThisWeek": true,
            "IsNextWeek": false,
            "Type": "ThisWeek"
          }
        ],
        "LegendItems": [],
        "InstitutionConfig": {},
        "DateConfig": {
          "FirstDayInWeek": 1,
          "StartDate": "2021-09-20T00:00:00+00:00",
          "EndDate": "2022-09-20T00:00:00+00:00"
        },
        "AllDays": [
          {
            "Name": "Monday",
            "DayOfWeek": 1,
            "IsDefault": true
          },
          {
            "Name": "Tuesday",
            "DayOfWeek": 2,
            "IsDefault": true
          },
          {
            "Name": "Wednesday",
            "DayOfWeek": 3,
            "IsDefault": true
          },
          {
            "Name": "Thursday",
            "DayOfWeek": 4,
            "IsDefault": true
          },
          {
            "Name": "Friday",
            "DayOfWeek": 5,
            "IsDefault": true
          },
          {
            "Name": "Saturday",
            "DayOfWeek": 6,
            "IsDefault": false
          },
          {
            "Name": "Sunday",
            "DayOfWeek": 0,
            "IsDefault": false
          }
        ]
      },
      "CategoryIdentities": listOfLocations
      
    }

    return payload
    

    
x = 20
final_list = lambda test_list, x: [test_list[i:i+x] for i in range(0, len(test_list), x)]
locationChunks = final_list(locationList, x)


locationBooked = [] 
for count, listOfLocations in enumerate(locationChunks, start=1):
    print('%s of %s' %(count, len(locationChunks)))
    payload = update_payload(listOfLocations)

    url = 'https://opentimetable.dcu.ie/broker/api/categoryTypes/1e042cb1-547d-41d4-ae93-a1f2c3d34538/categories/events/filter'

    response = s.post(url, headers=headers, json=payload).json()
    
    for each in response:
        if len(each['CategoryEvents']) > 0:
            locationBooked += each['CategoryEvents']
            
            
            
df = pd.DataFrame(locationBooked)

输出: 2936 的前 5 行

print(df.head().to_string())
                          EventIdentity                HostKey                                                Description                EndDateTime  EventType  IsPublished   Location                                 Owner              StartDateTime  IsDeleted                       LastModified                                                                                                                                                                                                                                                                                                                                                  ExtraProperties  UserManuallyAddedEvent                        StatusIdentity Status StatusBackgroundColor                                                 Name                              Identity
0  026bce78-1cce-9354-43b2-720b96ba9e03       2122#SPLUSCD8040                                                       None  2022-02-10T19:00:00+00:00    Booking         True  AHC.ODG01  b8cf1f5a-9687-4440-86b8-13da2c69fa62  2022-02-10T17:00:00+00:00      False  2022-01-17T17:27:51.9494682+00:00                                                                                                                                                                                                                                                   [{'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '18-26', 'Rank': 3}]                   False  b48c85d4-19aa-4b19-87a6-63a5c6d2e630   None                  None                  Booking - AFU 21-22 (Grainne Reddy)  43c03d98-1c80-4ab5-a47a-19db340ab179
1  5dfe3e92-439f-d7b2-1aaf-384328680d90       2122#SPLUS42C975                                 Contemporary Irish Society  2022-02-10T13:00:00+00:00    Booking         True  AHC.ODG01  b8cf1f5a-9687-4440-86b8-13da2c69fa62  2022-02-10T10:30:00+00:00      False  2022-01-12T14:31:15.0526265+00:00                                                                                                                                                                                                                                                      [{'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '21', 'Rank': 3}]                   False  b48c85d4-19aa-4b19-87a6-63a5c6d2e630   None                  None  Booking - Boston University 21-22 (Sean Harrington)  d259d6bd-6849-42a4-b28c-d6849b2623c1
2  bd60b7f0-633e-b90a-79e6-29fceb4c2ea5  2122ED1009[2]OC/L5/01                                                    RE Cert  2022-02-10T16:00:00+00:00  On Campus         True  AHC.ODG01  b8cf1f5a-9687-4440-86b8-13da2c69fa62  2022-02-10T15:00:00+00:00      False  2021-11-03T10:03:42.4098645+00:00                 [{'Name': 'Module Name', 'DisplayName': 'Module Name', 'Value': 'ED1009[0] Religions, Ethics & Moral Values (CIC)', 'Rank': 1}, {'Name': 'Staff Member', 'DisplayName': 'Staff Member', 'Value': 'Wilkinson J', 'Rank': 2}, {'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '17-22, 24-26', 'Rank': 3}]                   False  b48c85d4-19aa-4b19-87a6-63a5c6d2e630   None                  None                                    ED1009[2]OC/L5/01  9589b2ef-17d7-4372-8870-5749c3ae6c37
3  bc161577-4c45-0ce9-a9c0-25fc942b12c0       2122#SPLUS57140A  Teacher as a Reflective Practitioner (School Placement)**  2022-02-09T10:00:00+00:00  On Campus         True  AHC.ODG01  b8cf1f5a-9687-4440-86b8-13da2c69fa62  2022-02-09T09:00:00+00:00      False  2021-10-28T10:35:59.4486955+00:00            [{'Name': 'Module Name', 'DisplayName': 'Module Name', 'Value': 'ED1024[0] Teacher as a Reflective Practitioner (SP)', 'Rank': 1}, {'Name': 'Staff Member', 'DisplayName': 'Staff Member', 'Value': 'Lodge A', 'Rank': 2}, {'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '2-11, 17-22, 24-26', 'Rank': 3}]                   False  b48c85d4-19aa-4b19-87a6-63a5c6d2e630   None                  None                                    ED1024[0]OC/T1/07  d213893f-3e0a-4a01-a0d9-7dac1d873627
4  a804c1a7-ef87-d8f0-16be-502d396699d6     2122ED2009[2]L1/01      Religions, Ethics, Morals and Values Education (REMV)  2022-02-11T15:00:00+00:00  On Campus         True  AHC.ODG01  b8cf1f5a-9687-4440-86b8-13da2c69fa62  2022-02-11T14:00:00+00:00      False  2021-09-21T16:50:21.8417239+00:00  [{'Name': 'Module Name', 'DisplayName': 'Module Name', 'Value': 'ED2009[0] Religious, Ethics, Morals & Values Education', 'Rank': 1}, {'Name': 'Staff Member', 'DisplayName': 'Staff Member', 'Value': 'Lodge A, Wilkinson J', 'Rank': 2}, {'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '17-22, 24-25', 'Rank': 3}]                   False  b48c85d4-19aa-4b19-87a6-63a5c6d2e630   None                  None                                    ED2009[2]OC/L1/01  6facf5d3-6cb3-45f3-a9c1-3311a74691e7