正在解析 Google 搜索响应 Json Python 请求。正则表达式多行
Parsing Google Search Response Json Python requests. Regex Multiline
我想将原始响应转换为有效的 JSON,我可以做到,但方式有点草率。
这是原来的回复:
// API callback
google.search.Search.apiary2387({
"cursor": {
"currentPageIndex": 0,
"estimatedResultCount": "4490",
"moreResultsUrl": "http://www.google.com/cse?oe=utf8&ie=utf8&source=uds&q=ssh&start=0&sort=&cx=013305635491195529773:0ufpuq-fpt0",
"resultCount": "4,490",
"searchResultTime": "0.22",
"pages": [
{
"label": 1,
"start": "0"
},
{
"label": 2,
"start": "1"
},
{
"label": 3,
"start": "2"
},
{
"label": 4,
"start": "3"
},
{
"label": 5,
"start": "4"
},
{
"label": 6,
"start": "5"
},
{
"label": 7,
"start": "6"
},
{
"label": 8,
"start": "7"
},
{
"label": 9,
"start": "8"
},
{
"label": 10,
"start": "9"
}
]
},
"context": {
"title": "Pastebin Active",
"total_results": "0",
"facets": []
},
"results": [
{
"GsearchResultClass": "GwebSearch",
"cacheUrl": "http://www.google.com/search?q=cache:PBL2A25kpZoJ:pastebin.com",
"clicktrackUrl": "https://www.google.com/url?q=http://pastebin.com/u/ssh&sa=U&ved=0ahUKEwiO4fjNpovMAhWBPxoKHYJXAS4QFggEMAA&client=internal-uds-cse&usg=AFQjCNHczEhDXdcUnRZhpArEeSiHfjwMJA",
"content": "BitBucket - Backup your code in the cloud! Host unlimited private projects, for free\n. SIGN UP takes 10 seconds, and it's free! Guest ...",
"contentNoFormatting": "BitBucket - Backup your code in the cloud! Host unlimited private projects, for free\n. SIGN UP takes 10 seconds, and it's free! Guest ...",
"formattedUrl": "pastebin.com/u/\u003cb\u003essh\u003c/b\u003e",
"title": "\u003cb\u003eSsh's\u003c/b\u003e Pastebin - Pastebin.com",
"titleNoFormatting": "Ssh's Pastebin - Pastebin.com",
"unescapedUrl": "http://pastebin.com/u/ssh",
"url": "http://pastebin.com/u/ssh",
"visibleUrl": "pastebin.com",
"richSnippet": {
"cseImage": {
"src": "http://pastebin.com/i/facebook.png"
},
"metatags": {
"fbAppId": "231493360234820",
"ogTitle": "Ssh's Pastebin - Pastebin.com",
"ogType": "article",
"ogUrl": "http://pastebin.com/u/ssh",
"ogImage": "http://pastebin.com/i/facebook.png",
"ogSiteName": "Pastebin",
"viewport": "width=device-width, maximum-scale=1.0, user-scalable=no"
}
}
}
]
}
);
为了提取有效的 JSON,我必须删除 JavaScript 调用,所以我删除了第一个 (
之前的所有内容,最后删除了 )
.
这就是我认为的工作方式:
import requests
import re
import json
url = 'https://www.googleapis.com/customsearch/v1element?key=AIzaSyCVAXiUzRYsML1Pv6RwSG1gunmMikTzQqY&rsz=filtered_cse&num=1&hl=en&prettyPrint=true&source=gcsc&gss=.com&sig=432dd570d1a386253361f581254f9ca1&start=0&cx=013305635491195529773:0ufpuq-fpt0&q=ssh&sort=&googlehost=www.google.com&callback=google.search.Search.apiary2387'
resp = requests.get(url)
content = resp.content
formatted = re.sub(r'(.*\(|\);$)','', content , re.I|re.M|re.DOTALL)
formatted_json = json.loads(formatted)
for i, result in enumerate(formatted_json['results']):
print formatted_json['results'][i]['url']
这是我必须添加的内容才能使其正常工作:
formatted = re.sub(r'// API callback', '', content)
我不知道为什么,因为在找到 (
之前我要删除所有内容,如果我使用标志 re.M
¿ 为什么它不适用于所有行?
您可以看到 (r'(.*\(|\);$)','', content , re.DOTALL)
应该有效:
https://regex101.com/r/uN2wV4/3
(选项 /s
表示 .
也像 DOTALL
一样 \n
)
我创建了以下 regex
:
(\{(?:.|\n)*\})
它不是替换,而是获取左大括号和右大括号之间的内容。
因此您可以将此与 re.search
一起使用以获得您需要的东西:
formatted = re.search(r'(\{(?:.|\n)*\})', content).group()
更新: 使用 re.DOTALL
re.DOTALL
相当于 /s
修饰符(已更新 regex):
formatted = re.search(r'(\{.*\})', content, re.DOTALL).group()
最简单的方法:只需从请求中删除此参数:
callback=google.search.Search.apiary2387
并且响应有效 Json。
我想将原始响应转换为有效的 JSON,我可以做到,但方式有点草率。
这是原来的回复:
// API callback
google.search.Search.apiary2387({
"cursor": {
"currentPageIndex": 0,
"estimatedResultCount": "4490",
"moreResultsUrl": "http://www.google.com/cse?oe=utf8&ie=utf8&source=uds&q=ssh&start=0&sort=&cx=013305635491195529773:0ufpuq-fpt0",
"resultCount": "4,490",
"searchResultTime": "0.22",
"pages": [
{
"label": 1,
"start": "0"
},
{
"label": 2,
"start": "1"
},
{
"label": 3,
"start": "2"
},
{
"label": 4,
"start": "3"
},
{
"label": 5,
"start": "4"
},
{
"label": 6,
"start": "5"
},
{
"label": 7,
"start": "6"
},
{
"label": 8,
"start": "7"
},
{
"label": 9,
"start": "8"
},
{
"label": 10,
"start": "9"
}
]
},
"context": {
"title": "Pastebin Active",
"total_results": "0",
"facets": []
},
"results": [
{
"GsearchResultClass": "GwebSearch",
"cacheUrl": "http://www.google.com/search?q=cache:PBL2A25kpZoJ:pastebin.com",
"clicktrackUrl": "https://www.google.com/url?q=http://pastebin.com/u/ssh&sa=U&ved=0ahUKEwiO4fjNpovMAhWBPxoKHYJXAS4QFggEMAA&client=internal-uds-cse&usg=AFQjCNHczEhDXdcUnRZhpArEeSiHfjwMJA",
"content": "BitBucket - Backup your code in the cloud! Host unlimited private projects, for free\n. SIGN UP takes 10 seconds, and it's free! Guest ...",
"contentNoFormatting": "BitBucket - Backup your code in the cloud! Host unlimited private projects, for free\n. SIGN UP takes 10 seconds, and it's free! Guest ...",
"formattedUrl": "pastebin.com/u/\u003cb\u003essh\u003c/b\u003e",
"title": "\u003cb\u003eSsh's\u003c/b\u003e Pastebin - Pastebin.com",
"titleNoFormatting": "Ssh's Pastebin - Pastebin.com",
"unescapedUrl": "http://pastebin.com/u/ssh",
"url": "http://pastebin.com/u/ssh",
"visibleUrl": "pastebin.com",
"richSnippet": {
"cseImage": {
"src": "http://pastebin.com/i/facebook.png"
},
"metatags": {
"fbAppId": "231493360234820",
"ogTitle": "Ssh's Pastebin - Pastebin.com",
"ogType": "article",
"ogUrl": "http://pastebin.com/u/ssh",
"ogImage": "http://pastebin.com/i/facebook.png",
"ogSiteName": "Pastebin",
"viewport": "width=device-width, maximum-scale=1.0, user-scalable=no"
}
}
}
]
}
);
为了提取有效的 JSON,我必须删除 JavaScript 调用,所以我删除了第一个 (
之前的所有内容,最后删除了 )
.
这就是我认为的工作方式:
import requests
import re
import json
url = 'https://www.googleapis.com/customsearch/v1element?key=AIzaSyCVAXiUzRYsML1Pv6RwSG1gunmMikTzQqY&rsz=filtered_cse&num=1&hl=en&prettyPrint=true&source=gcsc&gss=.com&sig=432dd570d1a386253361f581254f9ca1&start=0&cx=013305635491195529773:0ufpuq-fpt0&q=ssh&sort=&googlehost=www.google.com&callback=google.search.Search.apiary2387'
resp = requests.get(url)
content = resp.content
formatted = re.sub(r'(.*\(|\);$)','', content , re.I|re.M|re.DOTALL)
formatted_json = json.loads(formatted)
for i, result in enumerate(formatted_json['results']):
print formatted_json['results'][i]['url']
这是我必须添加的内容才能使其正常工作:
formatted = re.sub(r'// API callback', '', content)
我不知道为什么,因为在找到 (
之前我要删除所有内容,如果我使用标志 re.M
¿ 为什么它不适用于所有行?
您可以看到 (r'(.*\(|\);$)','', content , re.DOTALL)
应该有效:
https://regex101.com/r/uN2wV4/3
(选项 /s
表示 .
也像 DOTALL
一样 \n
)
我创建了以下 regex
:
(\{(?:.|\n)*\})
它不是替换,而是获取左大括号和右大括号之间的内容。
因此您可以将此与 re.search
一起使用以获得您需要的东西:
formatted = re.search(r'(\{(?:.|\n)*\})', content).group()
更新: 使用 re.DOTALL
re.DOTALL
相当于 /s
修饰符(已更新 regex):
formatted = re.search(r'(\{.*\})', content, re.DOTALL).group()
最简单的方法:只需从请求中删除此参数:
callback=google.search.Search.apiary2387
并且响应有效 Json。