网络抓取大学时间表一直给我 'charmap' 错误
web-scraping college timetable keeps giving me 'charmap' error
我正在尝试为一个项目进行网络抓取,并在时间表上找到空闲时间。当我 运行 此代码时,出现以下错误“'charmap' 编解码器无法解码位置 32078 中的字节 0x8d:字符映射到”
这是我的代码
url = "https://opentimetable.dcu.ie/"
response = requests.get(url)
with open('webpage.html', 'r') as html_file:
content = html_file.read()
感谢任何帮助
您没有得到任何东西的原因是因为此数据是动态呈现的。您需要 select 不同的参数来查询您要查询的内容,它不会出现在简单的静态请求中。
因此,有一个 api 可以让您选择按不同类别进行搜索。唯一值最少的类别是 "Location"
,所以我选择了它。
这将获取所有位置 ID,然后将其输入过滤器以查找每个位置在什么时间预订了什么。
您在 table 中有预订的开始和结束时间(和日期)。我将留给您解析该信息以查找何时开放的日期或时间。我只想做的是 python 创建一个列表,其中包含所有唯一的日期和时间开始时间以及时间结束时间,对其进行排序,然后找到 gaps/no 重叠的位置。
import requests
import pandas as pd
s = requests.Session()
url = "https://opentimetable.dcu.ie/broker/api/categoryTypeOptions"
s.get(url)
cookies = s.cookies.get_dict()
cookieStr = ''
for k, v in cookies.items():
cookieStr += f'{k}={v};'
headers = {
'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,en-GB;q=0.8',
'Authorization': 'basic T64Mdy7m[',
'Connection': 'keep-alive',
'Content-Type': 'application/json',
'Cookie': cookieStr,
'Host': 'opentimetable.dcu.ie',
'Referer': 'https://opentimetable.dcu.ie/',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url = 'https://opentimetable.dcu.ie/broker/api/CategoryTypes/1e042cb1-547d-41d4-ae93-a1f2c3d34538/Categories/Filter?pageNumber=1'
payload = '[{"Identity": "6359fd0c-1bbe-496a-8998-4fefc5cd18de","Values": ["null"]}]'
jsonData = s.post(url, headers=headers, data=payload).json()
totalPages = jsonData['TotalPages']
print('Page: 1 of %s' %totalPages)
locationList = jsonData['Results']
for page in range(2, totalPages+1):
print('Page: %s of %s' %(page,totalPages))
url = 'https://opentimetable.dcu.ie/broker/api/CategoryTypes/1e042cb1-547d-41d4-ae93-a1f2c3d34538/Categories/Filter?pageNumber=%s' %(page)
jsonData = s.post(url, headers=headers, data=payload).json()
locationList += jsonData['Results']
locationList = [x['Identity'] for x in locationList]
def update_payload(listOfLocations):
true = "true"
false = "false"
payload = {
"ViewOptions": {
"Days": [
{
"Name": "Monday",
"DayOfWeek": 1,
"IsDefault": true
},
{
"Name": "Tuesday",
"DayOfWeek": 2,
"IsDefault": true
},
{
"Name": "Wednesday",
"DayOfWeek": 3,
"IsDefault": true
},
{
"Name": "Thursday",
"DayOfWeek": 4,
"IsDefault": true
},
{
"Name": "Friday",
"DayOfWeek": 5,
"IsDefault": true
}
],
"Weeks": [
{
"WeekNumber": 21,
"WeekLabel": "21",
"FirstDayInWeek": "2022-02-07T00:00:00.000Z"
}
],
"TimePeriods": [
{
"Description": "All Day",
"StartTime": "08:00",
"EndTime": "22:00",
"IsDefault": true
}
],
"DatePeriods": [
{
"Description": "This Week",
"StartDateTime": "2021-09-20T00:00:00.000Z",
"EndDateTime": "2022-09-20T00:00:00.000Z",
"IsDefault": true,
"IsThisWeek": true,
"IsNextWeek": false,
"Type": "ThisWeek"
}
],
"LegendItems": [],
"InstitutionConfig": {},
"DateConfig": {
"FirstDayInWeek": 1,
"StartDate": "2021-09-20T00:00:00+00:00",
"EndDate": "2022-09-20T00:00:00+00:00"
},
"AllDays": [
{
"Name": "Monday",
"DayOfWeek": 1,
"IsDefault": true
},
{
"Name": "Tuesday",
"DayOfWeek": 2,
"IsDefault": true
},
{
"Name": "Wednesday",
"DayOfWeek": 3,
"IsDefault": true
},
{
"Name": "Thursday",
"DayOfWeek": 4,
"IsDefault": true
},
{
"Name": "Friday",
"DayOfWeek": 5,
"IsDefault": true
},
{
"Name": "Saturday",
"DayOfWeek": 6,
"IsDefault": false
},
{
"Name": "Sunday",
"DayOfWeek": 0,
"IsDefault": false
}
]
},
"CategoryIdentities": listOfLocations
}
return payload
x = 20
final_list = lambda test_list, x: [test_list[i:i+x] for i in range(0, len(test_list), x)]
locationChunks = final_list(locationList, x)
locationBooked = []
for count, listOfLocations in enumerate(locationChunks, start=1):
print('%s of %s' %(count, len(locationChunks)))
payload = update_payload(listOfLocations)
url = 'https://opentimetable.dcu.ie/broker/api/categoryTypes/1e042cb1-547d-41d4-ae93-a1f2c3d34538/categories/events/filter'
response = s.post(url, headers=headers, json=payload).json()
for each in response:
if len(each['CategoryEvents']) > 0:
locationBooked += each['CategoryEvents']
df = pd.DataFrame(locationBooked)
输出: 2936 的前 5 行
print(df.head().to_string())
EventIdentity HostKey Description EndDateTime EventType IsPublished Location Owner StartDateTime IsDeleted LastModified ExtraProperties UserManuallyAddedEvent StatusIdentity Status StatusBackgroundColor Name Identity
0 026bce78-1cce-9354-43b2-720b96ba9e03 2122#SPLUSCD8040 None 2022-02-10T19:00:00+00:00 Booking True AHC.ODG01 b8cf1f5a-9687-4440-86b8-13da2c69fa62 2022-02-10T17:00:00+00:00 False 2022-01-17T17:27:51.9494682+00:00 [{'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '18-26', 'Rank': 3}] False b48c85d4-19aa-4b19-87a6-63a5c6d2e630 None None Booking - AFU 21-22 (Grainne Reddy) 43c03d98-1c80-4ab5-a47a-19db340ab179
1 5dfe3e92-439f-d7b2-1aaf-384328680d90 2122#SPLUS42C975 Contemporary Irish Society 2022-02-10T13:00:00+00:00 Booking True AHC.ODG01 b8cf1f5a-9687-4440-86b8-13da2c69fa62 2022-02-10T10:30:00+00:00 False 2022-01-12T14:31:15.0526265+00:00 [{'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '21', 'Rank': 3}] False b48c85d4-19aa-4b19-87a6-63a5c6d2e630 None None Booking - Boston University 21-22 (Sean Harrington) d259d6bd-6849-42a4-b28c-d6849b2623c1
2 bd60b7f0-633e-b90a-79e6-29fceb4c2ea5 2122ED1009[2]OC/L5/01 RE Cert 2022-02-10T16:00:00+00:00 On Campus True AHC.ODG01 b8cf1f5a-9687-4440-86b8-13da2c69fa62 2022-02-10T15:00:00+00:00 False 2021-11-03T10:03:42.4098645+00:00 [{'Name': 'Module Name', 'DisplayName': 'Module Name', 'Value': 'ED1009[0] Religions, Ethics & Moral Values (CIC)', 'Rank': 1}, {'Name': 'Staff Member', 'DisplayName': 'Staff Member', 'Value': 'Wilkinson J', 'Rank': 2}, {'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '17-22, 24-26', 'Rank': 3}] False b48c85d4-19aa-4b19-87a6-63a5c6d2e630 None None ED1009[2]OC/L5/01 9589b2ef-17d7-4372-8870-5749c3ae6c37
3 bc161577-4c45-0ce9-a9c0-25fc942b12c0 2122#SPLUS57140A Teacher as a Reflective Practitioner (School Placement)** 2022-02-09T10:00:00+00:00 On Campus True AHC.ODG01 b8cf1f5a-9687-4440-86b8-13da2c69fa62 2022-02-09T09:00:00+00:00 False 2021-10-28T10:35:59.4486955+00:00 [{'Name': 'Module Name', 'DisplayName': 'Module Name', 'Value': 'ED1024[0] Teacher as a Reflective Practitioner (SP)', 'Rank': 1}, {'Name': 'Staff Member', 'DisplayName': 'Staff Member', 'Value': 'Lodge A', 'Rank': 2}, {'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '2-11, 17-22, 24-26', 'Rank': 3}] False b48c85d4-19aa-4b19-87a6-63a5c6d2e630 None None ED1024[0]OC/T1/07 d213893f-3e0a-4a01-a0d9-7dac1d873627
4 a804c1a7-ef87-d8f0-16be-502d396699d6 2122ED2009[2]L1/01 Religions, Ethics, Morals and Values Education (REMV) 2022-02-11T15:00:00+00:00 On Campus True AHC.ODG01 b8cf1f5a-9687-4440-86b8-13da2c69fa62 2022-02-11T14:00:00+00:00 False 2021-09-21T16:50:21.8417239+00:00 [{'Name': 'Module Name', 'DisplayName': 'Module Name', 'Value': 'ED2009[0] Religious, Ethics, Morals & Values Education', 'Rank': 1}, {'Name': 'Staff Member', 'DisplayName': 'Staff Member', 'Value': 'Lodge A, Wilkinson J', 'Rank': 2}, {'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '17-22, 24-25', 'Rank': 3}] False b48c85d4-19aa-4b19-87a6-63a5c6d2e630 None None ED2009[2]OC/L1/01 6facf5d3-6cb3-45f3-a9c1-3311a74691e7
我正在尝试为一个项目进行网络抓取,并在时间表上找到空闲时间。当我 运行 此代码时,出现以下错误“'charmap' 编解码器无法解码位置 32078 中的字节 0x8d:字符映射到” 这是我的代码
url = "https://opentimetable.dcu.ie/"
response = requests.get(url)
with open('webpage.html', 'r') as html_file:
content = html_file.read()
感谢任何帮助
您没有得到任何东西的原因是因为此数据是动态呈现的。您需要 select 不同的参数来查询您要查询的内容,它不会出现在简单的静态请求中。
因此,有一个 api 可以让您选择按不同类别进行搜索。唯一值最少的类别是 "Location"
,所以我选择了它。
这将获取所有位置 ID,然后将其输入过滤器以查找每个位置在什么时间预订了什么。
您在 table 中有预订的开始和结束时间(和日期)。我将留给您解析该信息以查找何时开放的日期或时间。我只想做的是 python 创建一个列表,其中包含所有唯一的日期和时间开始时间以及时间结束时间,对其进行排序,然后找到 gaps/no 重叠的位置。
import requests
import pandas as pd
s = requests.Session()
url = "https://opentimetable.dcu.ie/broker/api/categoryTypeOptions"
s.get(url)
cookies = s.cookies.get_dict()
cookieStr = ''
for k, v in cookies.items():
cookieStr += f'{k}={v};'
headers = {
'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,en-GB;q=0.8',
'Authorization': 'basic T64Mdy7m[',
'Connection': 'keep-alive',
'Content-Type': 'application/json',
'Cookie': cookieStr,
'Host': 'opentimetable.dcu.ie',
'Referer': 'https://opentimetable.dcu.ie/',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url = 'https://opentimetable.dcu.ie/broker/api/CategoryTypes/1e042cb1-547d-41d4-ae93-a1f2c3d34538/Categories/Filter?pageNumber=1'
payload = '[{"Identity": "6359fd0c-1bbe-496a-8998-4fefc5cd18de","Values": ["null"]}]'
jsonData = s.post(url, headers=headers, data=payload).json()
totalPages = jsonData['TotalPages']
print('Page: 1 of %s' %totalPages)
locationList = jsonData['Results']
for page in range(2, totalPages+1):
print('Page: %s of %s' %(page,totalPages))
url = 'https://opentimetable.dcu.ie/broker/api/CategoryTypes/1e042cb1-547d-41d4-ae93-a1f2c3d34538/Categories/Filter?pageNumber=%s' %(page)
jsonData = s.post(url, headers=headers, data=payload).json()
locationList += jsonData['Results']
locationList = [x['Identity'] for x in locationList]
def update_payload(listOfLocations):
true = "true"
false = "false"
payload = {
"ViewOptions": {
"Days": [
{
"Name": "Monday",
"DayOfWeek": 1,
"IsDefault": true
},
{
"Name": "Tuesday",
"DayOfWeek": 2,
"IsDefault": true
},
{
"Name": "Wednesday",
"DayOfWeek": 3,
"IsDefault": true
},
{
"Name": "Thursday",
"DayOfWeek": 4,
"IsDefault": true
},
{
"Name": "Friday",
"DayOfWeek": 5,
"IsDefault": true
}
],
"Weeks": [
{
"WeekNumber": 21,
"WeekLabel": "21",
"FirstDayInWeek": "2022-02-07T00:00:00.000Z"
}
],
"TimePeriods": [
{
"Description": "All Day",
"StartTime": "08:00",
"EndTime": "22:00",
"IsDefault": true
}
],
"DatePeriods": [
{
"Description": "This Week",
"StartDateTime": "2021-09-20T00:00:00.000Z",
"EndDateTime": "2022-09-20T00:00:00.000Z",
"IsDefault": true,
"IsThisWeek": true,
"IsNextWeek": false,
"Type": "ThisWeek"
}
],
"LegendItems": [],
"InstitutionConfig": {},
"DateConfig": {
"FirstDayInWeek": 1,
"StartDate": "2021-09-20T00:00:00+00:00",
"EndDate": "2022-09-20T00:00:00+00:00"
},
"AllDays": [
{
"Name": "Monday",
"DayOfWeek": 1,
"IsDefault": true
},
{
"Name": "Tuesday",
"DayOfWeek": 2,
"IsDefault": true
},
{
"Name": "Wednesday",
"DayOfWeek": 3,
"IsDefault": true
},
{
"Name": "Thursday",
"DayOfWeek": 4,
"IsDefault": true
},
{
"Name": "Friday",
"DayOfWeek": 5,
"IsDefault": true
},
{
"Name": "Saturday",
"DayOfWeek": 6,
"IsDefault": false
},
{
"Name": "Sunday",
"DayOfWeek": 0,
"IsDefault": false
}
]
},
"CategoryIdentities": listOfLocations
}
return payload
x = 20
final_list = lambda test_list, x: [test_list[i:i+x] for i in range(0, len(test_list), x)]
locationChunks = final_list(locationList, x)
locationBooked = []
for count, listOfLocations in enumerate(locationChunks, start=1):
print('%s of %s' %(count, len(locationChunks)))
payload = update_payload(listOfLocations)
url = 'https://opentimetable.dcu.ie/broker/api/categoryTypes/1e042cb1-547d-41d4-ae93-a1f2c3d34538/categories/events/filter'
response = s.post(url, headers=headers, json=payload).json()
for each in response:
if len(each['CategoryEvents']) > 0:
locationBooked += each['CategoryEvents']
df = pd.DataFrame(locationBooked)
输出: 2936 的前 5 行
print(df.head().to_string())
EventIdentity HostKey Description EndDateTime EventType IsPublished Location Owner StartDateTime IsDeleted LastModified ExtraProperties UserManuallyAddedEvent StatusIdentity Status StatusBackgroundColor Name Identity
0 026bce78-1cce-9354-43b2-720b96ba9e03 2122#SPLUSCD8040 None 2022-02-10T19:00:00+00:00 Booking True AHC.ODG01 b8cf1f5a-9687-4440-86b8-13da2c69fa62 2022-02-10T17:00:00+00:00 False 2022-01-17T17:27:51.9494682+00:00 [{'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '18-26', 'Rank': 3}] False b48c85d4-19aa-4b19-87a6-63a5c6d2e630 None None Booking - AFU 21-22 (Grainne Reddy) 43c03d98-1c80-4ab5-a47a-19db340ab179
1 5dfe3e92-439f-d7b2-1aaf-384328680d90 2122#SPLUS42C975 Contemporary Irish Society 2022-02-10T13:00:00+00:00 Booking True AHC.ODG01 b8cf1f5a-9687-4440-86b8-13da2c69fa62 2022-02-10T10:30:00+00:00 False 2022-01-12T14:31:15.0526265+00:00 [{'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '21', 'Rank': 3}] False b48c85d4-19aa-4b19-87a6-63a5c6d2e630 None None Booking - Boston University 21-22 (Sean Harrington) d259d6bd-6849-42a4-b28c-d6849b2623c1
2 bd60b7f0-633e-b90a-79e6-29fceb4c2ea5 2122ED1009[2]OC/L5/01 RE Cert 2022-02-10T16:00:00+00:00 On Campus True AHC.ODG01 b8cf1f5a-9687-4440-86b8-13da2c69fa62 2022-02-10T15:00:00+00:00 False 2021-11-03T10:03:42.4098645+00:00 [{'Name': 'Module Name', 'DisplayName': 'Module Name', 'Value': 'ED1009[0] Religions, Ethics & Moral Values (CIC)', 'Rank': 1}, {'Name': 'Staff Member', 'DisplayName': 'Staff Member', 'Value': 'Wilkinson J', 'Rank': 2}, {'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '17-22, 24-26', 'Rank': 3}] False b48c85d4-19aa-4b19-87a6-63a5c6d2e630 None None ED1009[2]OC/L5/01 9589b2ef-17d7-4372-8870-5749c3ae6c37
3 bc161577-4c45-0ce9-a9c0-25fc942b12c0 2122#SPLUS57140A Teacher as a Reflective Practitioner (School Placement)** 2022-02-09T10:00:00+00:00 On Campus True AHC.ODG01 b8cf1f5a-9687-4440-86b8-13da2c69fa62 2022-02-09T09:00:00+00:00 False 2021-10-28T10:35:59.4486955+00:00 [{'Name': 'Module Name', 'DisplayName': 'Module Name', 'Value': 'ED1024[0] Teacher as a Reflective Practitioner (SP)', 'Rank': 1}, {'Name': 'Staff Member', 'DisplayName': 'Staff Member', 'Value': 'Lodge A', 'Rank': 2}, {'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '2-11, 17-22, 24-26', 'Rank': 3}] False b48c85d4-19aa-4b19-87a6-63a5c6d2e630 None None ED1024[0]OC/T1/07 d213893f-3e0a-4a01-a0d9-7dac1d873627
4 a804c1a7-ef87-d8f0-16be-502d396699d6 2122ED2009[2]L1/01 Religions, Ethics, Morals and Values Education (REMV) 2022-02-11T15:00:00+00:00 On Campus True AHC.ODG01 b8cf1f5a-9687-4440-86b8-13da2c69fa62 2022-02-11T14:00:00+00:00 False 2021-09-21T16:50:21.8417239+00:00 [{'Name': 'Module Name', 'DisplayName': 'Module Name', 'Value': 'ED2009[0] Religious, Ethics, Morals & Values Education', 'Rank': 1}, {'Name': 'Staff Member', 'DisplayName': 'Staff Member', 'Value': 'Lodge A, Wilkinson J', 'Rank': 2}, {'Name': 'Activity.TeachingWeekPattern_PatternAsArray', 'DisplayName': 'Weeks', 'Value': '17-22, 24-25', 'Rank': 3}] False b48c85d4-19aa-4b19-87a6-63a5c6d2e630 None None ED2009[2]OC/L1/01 6facf5d3-6cb3-45f3-a9c1-3311a74691e7