绕过欧盟同意请求

Question

我一直在尝试从 google 搜索中提取数据，但我无法绕过“在您继续 Google 搜索之前”同意书。

我试图找到其他人建议使用参数 CONSENT=PENDING+999 或与 get 请求中的 CONSENT = YES+HU.hu+V10+B+256 类似的内容的解决方法和 saw。不幸的是，我无法使前者工作，而在后一种情况下，我不完全确定应该用什么替换最后三个元素。

我在下方附上来自 here 的示例代码。

import requests
import bs4

headers = {'User-Agent':'Chrome 83 (Toshiba; Intel(R) Core(TM) i3-2367M CPU @ 1.40 GHz)'\
           'Windows 7 Home Premium',
           'Accept':'text/html,application/xhtml+xml,application/xml;'\
           'q=0.9,image/webp,*/*;q=0.8',
           #'cookie': 'CONSENT = YES+HU.hu+V10+B+256' # what are the last three elements?  
           'cookie':'CONSENT=PENDING+999'
           }

text= "geeksforgeeks"
url = 'https://google.com/search?q=' + text
  
request_result=requests.get( url , headers = headers) # here's where the trouble happens 

soup = bs4.BeautifulSoup(request_result.text, "html.parser")

print(soup) # not what one would expect

heading_object=soup.find_all( 'h3' ) 
  
for info in heading_object:
    print(info.getText())
    print("------")

如有任何帮助，我们将不胜感激。

Answer 1

是的，确实 Google 使用 CONSENT cookie 来确定是否显示同意弹出窗口。我通过调整值来玩弄 cookie，我可以得出结论，将 CONSENT cookie 值设置为 YES+ 足以阻止同意 window 显示。

在您的代码中，您试图通过 headers 参数传递 cookie。我建议使用 cookies 参数。

用这个调整您的代码（并从 headers 中删除 cookie）：

request_result = requests.get( url, headers = headers, cookies = {'CONSENT' : 'YES+'} )

使用我的解决方案运行你的代码后我的输出：

GeeksforGeeks
------
GeeksforGeeks - YouTube
------
GeeksforGeeks | LinkedIn
------
GeeksforGeeks (@geeks_for_geeks) • Instagram photos and videos
------
GeeksforGeeks - Twitter
------
GeeksforGeeks - Home | Facebook
------
Geeks for Geeks - Crunchbase Company Profile & Funding
------

绕过欧盟同意请求

Bypassing EU consent request

python

cookies

web-scraping

python-requests