网页抓取 Header 问题

Question

我正在尝试从网站上抓取数据作为一种教育练习。我用的是python和美汤

我基本上是在查看页面上的产品，例如 http://www.asos.com/Women/Dresses/Cat/pgecategory.aspx?cid=8799#parentID=-1&pge=0&pgeSize=5&sort=-1

我注意到它有参数 pge 和 pgeSize，我可以在浏览器中更改它们并给出我期望的结果，但是当运行用户 python 请求时，它总是 returns相同的36个产品（默认36个）

我认为这是一个 header 问题，所以我尝试使用 curl Chrome 开发人员工具来尝试确定我需要哪些 header，但是使用 curl 我无法获得通过以下响应：

curl -c ~/cookie -H "Accept: application/xml" -H "Accept-Language: en-GB,en-US;q=0.8,en;q=0.6" -H "Content-Type: application/xml" -H "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" -X GET 'http://www.asos.com/Women/Dresses/Cat/pgecategory.aspx?cid=8799#parentID=-1&pge=0&pgeSize=5&sort=-1'

Object 感动

Object 移至 here。

调试和尝试解决这个问题的正确方法是什么？

Answer 1

您需要提供 asos cookie，例如使用此卷曲标志：

curl --cookie "asos=currencyid=19" 'http://www.asos.com/Women/Dresses/Cat/pgecategory.aspx?cid=8799#parentID=-1&pge=0&pgeSize=5&sort=-1'

Answer 2

URL /Women/Dresses/Cat/pgecategory.aspx?cid=8799&r=2.

始终返回默认礼服

注意 parentID=-1&pge=7&pgeSize=5&sort=-1 在 # 符号之后。

还有一个额外的查询可以获取合适的衣服并为您更换。

网页抓取 Header 问题

Web Scraping Header issue

python

curl

beautifulsoup

Object 移至 here。