从 Python 中的许多 Google 搜索中抓取链接

Question

我想抓取出现在 Google 搜索中的第一个 link 搜索 23000 次并将它们附加到我正在使用的数据框。这是我收到的错误：

Traceback (most recent call last):
File "file.py", line 26, in <module>
website = showsome(company)
File "file.py", line 18, in showsome
hits = data['results']
TypeError: 'NoneType' object has no attribute '__getitem__'

这是我目前的代码：

import json
import urllib
import pandas as pd

def showsome(searchfor):
    query = urllib.urlencode({'q': searchfor})
    url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
    search_response = urllib.urlopen(url)
    search_results = search_response.read()
    results = json.loads(search_results)
    data = results['responseData']
    hits = data['results']
    d = hits[0]['visibleUrl']
    return d

company_names = pd.read_csv("my_file.csv")

websites = []
for company in company_names["Company"]:
    website = showsome(company)
    websites.append(website)
websites = pd.DataFrame(websites, columns=["Website"])

result = pd.concat([company_names,websites], axis=1, join='inner')
result.to_csv("export_file.csv", index=False, encoding="utf-8")

（出于隐私原因，我更改了输入和输出文件的名称）

谢谢！

Answer 1

我会试着回答为什么会出现这个异常-

我看到 google 检测到你并且 post 一个格式化的好响应，即

{u'responseData': None, u'responseDetails': u'Suspected Terms of Service Abuse. Please see http://code.google.com/apis/errors', u'responseStatus': 403}

然后通过以下表达式将其分配给 results。

results = json.loads(search_results)

所以 data = results['responseData'] 等于 None 并且当您运行 hits = data['results'] - data['results'] 引发错误，因为 data 是 None 并且它没有 results 属性-

我尝试使用random模块（只是一个简单的尝试）通过一些等待来模拟真实-（但是如果你没有[=38的许可，我强烈反对使用这个=] 顺便说一句，我使用 time.sleep(random.choice((1,3,3,2,4,1,0))) 如下。

import json,random,time
import urllib
import pandas as pd

def showsome(searchfor):
    query = urllib.urlencode({'q': searchfor})
    url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
    search_response = urllib.urlopen(url)
    search_results = search_response.read()
    results = json.loads(search_results)
    data = results['responseData']
    hits = data['results']
    d = hits[0]['visibleUrl']
    return d

company_names = pd.read_csv("my_file.csv")

websites = []
for company in company_names["Company"]:
    website = showsome(company)
    websites.append(website)
    time.sleep(random.choice((1,3,3,2,4,1,0)))
    print website
websites = pd.DataFrame(websites, columns=["Website"])

result = pd.concat([company_names,websites], axis=1, join='inner')
result.to_csv("export_file.csv", index=False, encoding="utf-8")

它生成包含-

的csv

Company,Website
American Axle,www.aam.com
American Broadcasting Company,en.wikipedia.org
American Eagle Outfitters,ae.com
American Electric Power,www.aep.com
American Express,www.americanexpress.com
American Family Insurance,www.amfam.com
American Financial Group,www.afginc.com
American Greetings,www.americangreetings.com

从 Python 中的许多 Google 搜索中抓取链接

Scrape links from many Google searches in Python

python

json

urllib

web-scraping

scrape