从 pogdesign.co.uk/cat/ 抓取数据

Data Scraping from pogdesign.co.uk/cat/

我正在尝试从中抓取一些数据 http://www.pogdesign.co.uk/cat/.

我想获取每个节目的频道和播放时间,但问题是默认情况下它们不会出现。只有手动配置设置并保存后,才会出现每个节目的频道和播出时间。

据我了解,在检查 Chrome 的开发人员工具中的 'Network' 部分后,我点击 'Save Settings' 后实际发生的是 POST 请求正在发送,带有相关数据参数(例如 's_networks':'on' 等),然后发送 GET 请求,以检索带有频道和显示的广播时间的 html 文件。

我尝试使用两者来模拟此过程(POST 请求然后是 GET 请求) python 的 requests 包,以及 mechanicalsoup 包。

requests:

s = requests.Session()
s.post('http://www.pogdesign.co.uk/cat/', data = {'s_networks':'on'})
s.get('http://www.pogdesign.co.uk/cat/')

mechanicalsoup:

mcs = mechanicalsoup.Browser()
res_post = mcs.post('http://www.pogdesign.co.uk/cat/', data {'s_networks':'on'})
res_get = mcs.get('http://www.pogdesign.co.uk/cat/')

但我收到的响应不包含频道和广播时间数据。

我注意到的唯一区别是浏览器的 POST 请求返回的状态码是 302,而我的 python 请求返回的状态码是 200.

因为cookie存储了用户信息,可以试试下面的代码

import requests

s = requests.Session()
data = {
    "style": 3,
    "timezone": "GMT",
    "s_numbers": "on",
    "s_epnames": "on",
    "s_airtimes": "on",
    "s_popups": "on",
    "s_wunwatched": "on",
    "s_sortbyname": "on",
    "s_weekstyle": "on",
    "s_24hr": "on",
    "settings": None
}
cookies = { # you can get the cookie info from dev tool
    "CAT_UID":'' ,
    "PHPSESSID":'' ,
    "_ga": '',
    "_gid": '',
    "_gat": ""
}
post = s.post('http://www.pogdesign.co.uk/cat/', data=data, cookies=cookies)
text = post.text
get = s.get('http://www.pogdesign.co.uk/cat/', cookies=cookies)
text1 = get.text