用 python 和 json 抓取这个网站的正确 URL 是什么?
What is the proper URL to scrape this website with python and json?
正在尝试抓取该网站 --> https://ucr.gov/enforcement/1000511
它曾经与下面的代码一起工作,然后停止了。无法获得 json 或响应中的任何内容。
query = "1000511"
url = 'https://ucr.gov/api/enforcement/{}'.format(query)
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'https://ucr.gov/enforcement/{}'.format(query),
'Cache-Control': 'no-cache,no-store,must-revalidate,max-age=0,private',
'content-type': 'application/json;charset=UTF-8',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
'UCR-UI-Version': '20.5.4',
'Origin': 'https://ucr.gov',
'Connection': 'keep-alive',
}
s = requests.Session()
params = (
('pageNumber', '0'),
('itemsPerPage', '15'),
)
response = s.get(url, headers=headers, params=params)
response.json()
预期的内容可以在这里找到:https://ucr.gov/enforcement/1000511
相反,我收到此错误:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
同样,这在几周前曾经有效。请帮我找出错误。
更正 1:
- 我最初将 url 发布为:
url = 'https://admin.ucr.gov/api/enforcement/{}'.format(query)
这是以前的工作方式。现在,我看到该网站使用相同的 url 但没有 "admin" (上面的代码已更改)。但如果您访问 https://ucr.gov/enforcement/1000511
,我仍然没有得到预期的任何 results/content
使用(例如)Chrome's DevTools 你可以看到进行了以下调用:
然后您可以将其复制为 cUrl 并在需要 headers 的命令行上进行尝试:
$ curl 'https://admin.ucr.gov/api/enforcement' \
> -H 'authority: admin.ucr.gov' \
> -H 'accept: application/json, text/plain, */*' \
> -H 'cache-control: no-cache,no-store,must-revalidate,max-age=0,private' \
> -H 'ucr-ui-version: 20.5.4' \
> -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' \
> -H 'dnt: 1' \
> -H 'content-type: application/json;charset=UTF-8' \
> -H 'origin: https://ucr.gov' \
> -H 'sec-fetch-site: same-site' \
> -H 'sec-fetch-mode: cors' \
> -H 'sec-fetch-dest: empty' \
> -H 'referer: https://ucr.gov/enforcement/1000511' \
> -H 'accept-language: it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7' \
> --data-binary '{"searchTerm":"1000511","itemsPerPage":15,"pageNumber":0}' \
> --compressed
{"carrier":{"usdot":1000511,"legalName":"877599 ALBERTA LTD","dateAdded":"2002-01-24T00:00:00Z","physicalAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"},"mailingAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"}},"history":{"enforcementRegistrations":[{"year":2020,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5166421+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2019,"status":"unregistered","updateTime":"2020-06-15T16:12:53.53272+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2018,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5486724+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2017,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5646021+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:12"}]}}
现在你可以尝试一个一个地删除headers,你会发现这个请求成功了:
$ curl 'https://admin.ucr.gov/api/enforcement' --data-binary '{"searchTerm":"1000511"}' -H 'ucr-ui-version: 20.5.4' -H 'content-type: application/json;charset=UTF-8'
{"carrier":{"usdot":1000511,"legalName":"877599 ALBERTA LTD","dateAdded":"2002-01-24T00:00:00Z","physicalAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"},"mailingAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"}},"history":{"enforcementRegistrations":[{"year":2020,"status":"unregistered","updateTime":"2020-06-15T16:20:41.3271743+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2019,"status":"unregistered","updateTime":"2020-06-15T16:20:41.3951487+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2018,"status":"unregistered","updateTime":"2020-06-15T16:20:41.468421+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2017,"status":"unregistered","updateTime":"2020-06-15T16:20:41.5511652+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:20"}]}}
现在将 curl
调用转换为 python。请注意,调用是 POST 而不是代码中的 GET!:
In [1]: import requests
In [2]: import io
...: response = requests.post('https://admin.ucr.gov/api/enforcement', data=io.StringIO('{"searchTerm":"1000511"}'), headers={'ucr-ui-version': '20.5.4', 'content-type': 'application/json;charset=UTF-8'})
In [3]: response.status_code
Out[3]: 200
In [4]: response.json()
Out[4]:
{'carrier': {'usdot': 1000511,
'legalName': '877599 ALBERTA LTD',
'dateAdded': '2002-01-24T00:00:00Z',
'physicalAddress': {'street': '430 66 STREET SW',
'city': 'EDMONTON',
'state': 'AB',
'region': 'CAAB',
'zipCode': 'T6X 1A3',
'country': 'C',
'countryCode': 'CA'},
'mailingAddress': {'street': '430 66 STREET SW',
'city': 'EDMONTON',
'state': 'AB',
'region': 'CAAB',
'zipCode': 'T6X 1A3',
'country': 'C',
'countryCode': 'CA'}},
'history': {'enforcementRegistrations': [{'year': 2020,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.2114276+00:00',
'isApplicable': True,
'isYearActive': True,
'updateTimeDisplay': '06/15/2020 16:23'},
{'year': 2019,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.278671+00:00',
'isApplicable': True,
'isYearActive': True,
'updateTimeDisplay': '06/15/2020 16:23'},
{'year': 2018,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.3507073+00:00',
'isApplicable': True,
'isYearActive': False,
'updateTimeDisplay': '06/15/2020 16:23'},
{'year': 2017,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.4026579+00:00',
'isApplicable': True,
'isYearActive': False,
'updateTimeDisplay': '06/15/2020 16:23'}]}}
正在尝试抓取该网站 --> https://ucr.gov/enforcement/1000511 它曾经与下面的代码一起工作,然后停止了。无法获得 json 或响应中的任何内容。
query = "1000511"
url = 'https://ucr.gov/api/enforcement/{}'.format(query)
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'https://ucr.gov/enforcement/{}'.format(query),
'Cache-Control': 'no-cache,no-store,must-revalidate,max-age=0,private',
'content-type': 'application/json;charset=UTF-8',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
'UCR-UI-Version': '20.5.4',
'Origin': 'https://ucr.gov',
'Connection': 'keep-alive',
}
s = requests.Session()
params = (
('pageNumber', '0'),
('itemsPerPage', '15'),
)
response = s.get(url, headers=headers, params=params)
response.json()
预期的内容可以在这里找到:https://ucr.gov/enforcement/1000511
相反,我收到此错误:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
同样,这在几周前曾经有效。请帮我找出错误。
更正 1: - 我最初将 url 发布为:
url = 'https://admin.ucr.gov/api/enforcement/{}'.format(query)
这是以前的工作方式。现在,我看到该网站使用相同的 url 但没有 "admin" (上面的代码已更改)。但如果您访问 https://ucr.gov/enforcement/1000511
,我仍然没有得到预期的任何 results/content使用(例如)Chrome's DevTools 你可以看到进行了以下调用:
然后您可以将其复制为 cUrl 并在需要 headers 的命令行上进行尝试:
$ curl 'https://admin.ucr.gov/api/enforcement' \
> -H 'authority: admin.ucr.gov' \
> -H 'accept: application/json, text/plain, */*' \
> -H 'cache-control: no-cache,no-store,must-revalidate,max-age=0,private' \
> -H 'ucr-ui-version: 20.5.4' \
> -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' \
> -H 'dnt: 1' \
> -H 'content-type: application/json;charset=UTF-8' \
> -H 'origin: https://ucr.gov' \
> -H 'sec-fetch-site: same-site' \
> -H 'sec-fetch-mode: cors' \
> -H 'sec-fetch-dest: empty' \
> -H 'referer: https://ucr.gov/enforcement/1000511' \
> -H 'accept-language: it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7' \
> --data-binary '{"searchTerm":"1000511","itemsPerPage":15,"pageNumber":0}' \
> --compressed
{"carrier":{"usdot":1000511,"legalName":"877599 ALBERTA LTD","dateAdded":"2002-01-24T00:00:00Z","physicalAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"},"mailingAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"}},"history":{"enforcementRegistrations":[{"year":2020,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5166421+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2019,"status":"unregistered","updateTime":"2020-06-15T16:12:53.53272+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2018,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5486724+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2017,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5646021+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:12"}]}}
现在你可以尝试一个一个地删除headers,你会发现这个请求成功了:
$ curl 'https://admin.ucr.gov/api/enforcement' --data-binary '{"searchTerm":"1000511"}' -H 'ucr-ui-version: 20.5.4' -H 'content-type: application/json;charset=UTF-8'
{"carrier":{"usdot":1000511,"legalName":"877599 ALBERTA LTD","dateAdded":"2002-01-24T00:00:00Z","physicalAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"},"mailingAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"}},"history":{"enforcementRegistrations":[{"year":2020,"status":"unregistered","updateTime":"2020-06-15T16:20:41.3271743+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2019,"status":"unregistered","updateTime":"2020-06-15T16:20:41.3951487+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2018,"status":"unregistered","updateTime":"2020-06-15T16:20:41.468421+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2017,"status":"unregistered","updateTime":"2020-06-15T16:20:41.5511652+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:20"}]}}
现在将 curl
调用转换为 python。请注意,调用是 POST 而不是代码中的 GET!:
In [1]: import requests
In [2]: import io
...: response = requests.post('https://admin.ucr.gov/api/enforcement', data=io.StringIO('{"searchTerm":"1000511"}'), headers={'ucr-ui-version': '20.5.4', 'content-type': 'application/json;charset=UTF-8'})
In [3]: response.status_code
Out[3]: 200
In [4]: response.json()
Out[4]:
{'carrier': {'usdot': 1000511,
'legalName': '877599 ALBERTA LTD',
'dateAdded': '2002-01-24T00:00:00Z',
'physicalAddress': {'street': '430 66 STREET SW',
'city': 'EDMONTON',
'state': 'AB',
'region': 'CAAB',
'zipCode': 'T6X 1A3',
'country': 'C',
'countryCode': 'CA'},
'mailingAddress': {'street': '430 66 STREET SW',
'city': 'EDMONTON',
'state': 'AB',
'region': 'CAAB',
'zipCode': 'T6X 1A3',
'country': 'C',
'countryCode': 'CA'}},
'history': {'enforcementRegistrations': [{'year': 2020,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.2114276+00:00',
'isApplicable': True,
'isYearActive': True,
'updateTimeDisplay': '06/15/2020 16:23'},
{'year': 2019,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.278671+00:00',
'isApplicable': True,
'isYearActive': True,
'updateTimeDisplay': '06/15/2020 16:23'},
{'year': 2018,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.3507073+00:00',
'isApplicable': True,
'isYearActive': False,
'updateTimeDisplay': '06/15/2020 16:23'},
{'year': 2017,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.4026579+00:00',
'isApplicable': True,
'isYearActive': False,
'updateTimeDisplay': '06/15/2020 16:23'}]}}