从 pogdesign.co.uk/cat/ 抓取数据
Data Scraping from pogdesign.co.uk/cat/
我正在尝试从中抓取一些数据
http://www.pogdesign.co.uk/cat/
.
我想获取每个节目的频道和播放时间,但问题是默认情况下它们不会出现。只有手动配置设置并保存后,才会出现每个节目的频道和播出时间。
据我了解,在检查 Chrome 的开发人员工具中的 'Network' 部分后,我点击 'Save Settings' 后实际发生的是 POST 请求正在发送,带有相关数据参数(例如 's_networks':'on'
等),然后发送 GET 请求,以检索带有频道和显示的广播时间的 html 文件。
我尝试使用两者来模拟此过程(POST 请求然后是 GET 请求)
python 的 requests
包,以及 mechanicalsoup
包。
requests:
s = requests.Session()
s.post('http://www.pogdesign.co.uk/cat/', data = {'s_networks':'on'})
s.get('http://www.pogdesign.co.uk/cat/')
mechanicalsoup:
mcs = mechanicalsoup.Browser()
res_post = mcs.post('http://www.pogdesign.co.uk/cat/', data {'s_networks':'on'})
res_get = mcs.get('http://www.pogdesign.co.uk/cat/')
但我收到的响应不包含频道和广播时间数据。
我注意到的唯一区别是浏览器的 POST 请求返回的状态码是 302
,而我的 python 请求返回的状态码是 200
.
因为cookie存储了用户信息,可以试试下面的代码
import requests
s = requests.Session()
data = {
"style": 3,
"timezone": "GMT",
"s_numbers": "on",
"s_epnames": "on",
"s_airtimes": "on",
"s_popups": "on",
"s_wunwatched": "on",
"s_sortbyname": "on",
"s_weekstyle": "on",
"s_24hr": "on",
"settings": None
}
cookies = { # you can get the cookie info from dev tool
"CAT_UID":'' ,
"PHPSESSID":'' ,
"_ga": '',
"_gid": '',
"_gat": ""
}
post = s.post('http://www.pogdesign.co.uk/cat/', data=data, cookies=cookies)
text = post.text
get = s.get('http://www.pogdesign.co.uk/cat/', cookies=cookies)
text1 = get.text
我正在尝试从中抓取一些数据
http://www.pogdesign.co.uk/cat/
.
我想获取每个节目的频道和播放时间,但问题是默认情况下它们不会出现。只有手动配置设置并保存后,才会出现每个节目的频道和播出时间。
据我了解,在检查 Chrome 的开发人员工具中的 'Network' 部分后,我点击 'Save Settings' 后实际发生的是 POST 请求正在发送,带有相关数据参数(例如 's_networks':'on'
等),然后发送 GET 请求,以检索带有频道和显示的广播时间的 html 文件。
我尝试使用两者来模拟此过程(POST 请求然后是 GET 请求)
python 的 requests
包,以及 mechanicalsoup
包。
requests:
s = requests.Session()
s.post('http://www.pogdesign.co.uk/cat/', data = {'s_networks':'on'})
s.get('http://www.pogdesign.co.uk/cat/')
mechanicalsoup:
mcs = mechanicalsoup.Browser()
res_post = mcs.post('http://www.pogdesign.co.uk/cat/', data {'s_networks':'on'})
res_get = mcs.get('http://www.pogdesign.co.uk/cat/')
但我收到的响应不包含频道和广播时间数据。
我注意到的唯一区别是浏览器的 POST 请求返回的状态码是 302
,而我的 python 请求返回的状态码是 200
.
因为cookie存储了用户信息,可以试试下面的代码
import requests
s = requests.Session()
data = {
"style": 3,
"timezone": "GMT",
"s_numbers": "on",
"s_epnames": "on",
"s_airtimes": "on",
"s_popups": "on",
"s_wunwatched": "on",
"s_sortbyname": "on",
"s_weekstyle": "on",
"s_24hr": "on",
"settings": None
}
cookies = { # you can get the cookie info from dev tool
"CAT_UID":'' ,
"PHPSESSID":'' ,
"_ga": '',
"_gid": '',
"_gat": ""
}
post = s.post('http://www.pogdesign.co.uk/cat/', data=data, cookies=cookies)
text = post.text
get = s.get('http://www.pogdesign.co.uk/cat/', cookies=cookies)
text1 = get.text