使用 google 自定义搜索 API 搜索多个关键字时出现问题
Problem with searching multiple keywords using google custom search API
我正在尝试搜索多个关键字(在 filteredList 的列表中)并获取每个搜索结果的列表。这是我在下面尝试过的代码:
from googleapiclient.discovery import build
import csv
import pprint
my_api_key = "xxx"
my_cse_id = "xxx"
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']
filteredList = ['Optimal Elektronika',
'Evrascon',
]
words = [
'vakansiya'
]
newDictList = []
# this is the htmlSnippets, link and also htmlTitle for filtering over the list of the dictionaries
keyValList = ['link', 'htmlTitle', 'htmlSnippet']
for word in filteredList:
results = google_search(word, my_api_key, my_cse_id, num=5)
# print(results)
newDict = dict()
for result in results:
for (key, value) in result.items():
if key in keyValList:
if word in newDict['htmlSnippet']:
pass
newDict[key] = pprint.pprint(value)
newDictList.append(newDict)
print(newDictList)
运行答案脚本
我得到的错误代码(运行答案脚本):
Traceback (most recent call last):
File "/Users/valizadavali/PycharmProjects/webScrape/GCS.py", line 39, in <module>
items = google_search(word, API_KEY, CSE_ID, num=5)
File "/Users/valizadavali/PycharmProjects/webScrape/GCS.py", line 11, in google_search
return res['items']
KeyError: 'items'
我没有 API 键 运行 这段代码,但我看到了一些错误:
当你使用
for items in filteredList:
然后你从列表中得到单词,而不是它的索引,所以你不能将它与数字进行比较。
要获取号码,您可以使用
for items in range(len(filteredList)):
但是最好使用第一个版本而不是这个版本,然后在
中使用 items
而不是 filterd[items]
results = google_search(items, my_api_key, my_cse_id, num=5)
如果您选择带有 range(len(filteredList)):
的版本,则不要向项目添加 1 - 因为这样您得到的数字是 1..6
而不是 0..5
,所以您会跳过第一个元素 filteredList[0]
并且它不搜索第一个词。稍后您尝试获取列表中不存在的 filteredList[6]
并收到错误消息。
for word in filteredList:
results = google_search(word, my_api_key, my_cse_id, num=5)
print(results)
newDict = dict()
for result in results:
for (key, value) in result.items():
if key in keyValList:
newDict[key] = value
newDictList.append(newDict)
print(newDictList)
顺便说一句:你必须在每个循环中创建 newDict = dict()
。
顺便说一句:标准 print()
和 pprint.pprint()
仅用于在屏幕上发送文本并且总是 returns None
因此您不能将显示的文本分配给变量。如果您必须格式化文本,请为此使用字符串格式。
编辑: 带有 range(len(...))
的版本在 Python.
中不是首选
for index in range(len(filteredList)):
results = google_search(filteredList[index], my_api_key, my_cse_id, num=5)
print(results)
newDict = dict()
for result in results:
for (key, value) in result.items():
if key in keyValList:
newDict[key] = value
newDictList.append(newDict)
print(newDictList)
编辑:
from googleapiclient.discovery import build
import requests
API_KEY = "AIzXXX"
CSE_ID = "013XXX"
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']
words = [
'Semkir sistem',
'Evrascon',
'Baku Electronics',
'Optimal Elektroniks',
'Avtostar',
'Improtex',
# 'Wayback Machine'
]
filtered_results = list()
keys = ['cacheId', 'link', 'htmlTitle', 'htmlSnippet', ]
for word in words:
items = google_search(word, API_KEY, CSE_ID, num=5)
for item in items:
#print(item.keys()) # to check if every item has the same keys. It seems some items don't have 'cacheId'
row = dict() # row of data in final list with results
for key in keys:
row[key] = item.get(key) # None if there is no `key` in `item`
#row[key] = item[key] # ERROR if there is no `key` in `item`
# generate link to cached page
if row['cacheId']:
row['link_cache'] = 'https://webcache.googleusercontent.com/search?q=cache:{}:{}'.format(row['cacheId'], row['link'])
# TODO: read HTML from `link_cache` and get full text.
# Maybe module `newpaper` can be useful for some pages.
# For other pages module `urllib.request` or `requests` can be needed.
row['html'] = requests.get(row['link_cache']).text
else:
row['link_cache'] = None
row['html'] = ''
# check word in title and snippet. Word may use upper and lower case chars so I convert to lower case to skip this problem.
# It doesn't work if text use native chars - ie. cyrylica
lower_word = word.lower()
if (lower_word in row['htmlTitle'].lower()) or (lower_word in row['htmlSnippet'].lower()) or (lower_word in row['html'].lower()):
filtered_results.append(row)
else:
print('SKIP:', word)
print(' :', row['link'])
print(' :', row['htmlTitle'])
print(' :', row['htmlSnippet'])
print('-----')
for item in filtered_results:
print('htmlTitle:', item['htmlTitle'])
print('link:', item['link'])
print('cacheId:', item['cacheId'])
print('link_cache:', item['link_cache'])
print('part of html:', item['html'][:300])
print('---')
我正在尝试搜索多个关键字(在 filteredList 的列表中)并获取每个搜索结果的列表。这是我在下面尝试过的代码:
from googleapiclient.discovery import build
import csv
import pprint
my_api_key = "xxx"
my_cse_id = "xxx"
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']
filteredList = ['Optimal Elektronika',
'Evrascon',
]
words = [
'vakansiya'
]
newDictList = []
# this is the htmlSnippets, link and also htmlTitle for filtering over the list of the dictionaries
keyValList = ['link', 'htmlTitle', 'htmlSnippet']
for word in filteredList:
results = google_search(word, my_api_key, my_cse_id, num=5)
# print(results)
newDict = dict()
for result in results:
for (key, value) in result.items():
if key in keyValList:
if word in newDict['htmlSnippet']:
pass
newDict[key] = pprint.pprint(value)
newDictList.append(newDict)
print(newDictList)
运行答案脚本
我得到的错误代码(运行答案脚本):
Traceback (most recent call last):
File "/Users/valizadavali/PycharmProjects/webScrape/GCS.py", line 39, in <module>
items = google_search(word, API_KEY, CSE_ID, num=5)
File "/Users/valizadavali/PycharmProjects/webScrape/GCS.py", line 11, in google_search
return res['items']
KeyError: 'items'
我没有 API 键 运行 这段代码,但我看到了一些错误:
当你使用
for items in filteredList:
然后你从列表中得到单词,而不是它的索引,所以你不能将它与数字进行比较。
要获取号码,您可以使用
for items in range(len(filteredList)):
但是最好使用第一个版本而不是这个版本,然后在
中使用items
而不是 filterd[items]
results = google_search(items, my_api_key, my_cse_id, num=5)
如果您选择带有 range(len(filteredList)):
的版本,则不要向项目添加 1 - 因为这样您得到的数字是 1..6
而不是 0..5
,所以您会跳过第一个元素 filteredList[0]
并且它不搜索第一个词。稍后您尝试获取列表中不存在的 filteredList[6]
并收到错误消息。
for word in filteredList:
results = google_search(word, my_api_key, my_cse_id, num=5)
print(results)
newDict = dict()
for result in results:
for (key, value) in result.items():
if key in keyValList:
newDict[key] = value
newDictList.append(newDict)
print(newDictList)
顺便说一句:你必须在每个循环中创建 newDict = dict()
。
顺便说一句:标准 print()
和 pprint.pprint()
仅用于在屏幕上发送文本并且总是 returns None
因此您不能将显示的文本分配给变量。如果您必须格式化文本,请为此使用字符串格式。
编辑: 带有 range(len(...))
的版本在 Python.
for index in range(len(filteredList)):
results = google_search(filteredList[index], my_api_key, my_cse_id, num=5)
print(results)
newDict = dict()
for result in results:
for (key, value) in result.items():
if key in keyValList:
newDict[key] = value
newDictList.append(newDict)
print(newDictList)
编辑:
from googleapiclient.discovery import build
import requests
API_KEY = "AIzXXX"
CSE_ID = "013XXX"
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res['items']
words = [
'Semkir sistem',
'Evrascon',
'Baku Electronics',
'Optimal Elektroniks',
'Avtostar',
'Improtex',
# 'Wayback Machine'
]
filtered_results = list()
keys = ['cacheId', 'link', 'htmlTitle', 'htmlSnippet', ]
for word in words:
items = google_search(word, API_KEY, CSE_ID, num=5)
for item in items:
#print(item.keys()) # to check if every item has the same keys. It seems some items don't have 'cacheId'
row = dict() # row of data in final list with results
for key in keys:
row[key] = item.get(key) # None if there is no `key` in `item`
#row[key] = item[key] # ERROR if there is no `key` in `item`
# generate link to cached page
if row['cacheId']:
row['link_cache'] = 'https://webcache.googleusercontent.com/search?q=cache:{}:{}'.format(row['cacheId'], row['link'])
# TODO: read HTML from `link_cache` and get full text.
# Maybe module `newpaper` can be useful for some pages.
# For other pages module `urllib.request` or `requests` can be needed.
row['html'] = requests.get(row['link_cache']).text
else:
row['link_cache'] = None
row['html'] = ''
# check word in title and snippet. Word may use upper and lower case chars so I convert to lower case to skip this problem.
# It doesn't work if text use native chars - ie. cyrylica
lower_word = word.lower()
if (lower_word in row['htmlTitle'].lower()) or (lower_word in row['htmlSnippet'].lower()) or (lower_word in row['html'].lower()):
filtered_results.append(row)
else:
print('SKIP:', word)
print(' :', row['link'])
print(' :', row['htmlTitle'])
print(' :', row['htmlSnippet'])
print('-----')
for item in filtered_results:
print('htmlTitle:', item['htmlTitle'])
print('link:', item['link'])
print('cacheId:', item['cacheId'])
print('link_cache:', item['link_cache'])
print('part of html:', item['html'][:300])
print('---')