用 python 和 json 抓取这个网站的正确 URL 是什么?

What is the proper URL to scrape this website with python and json?

正在尝试抓取该网站 --> https://ucr.gov/enforcement/1000511 它曾经与下面的代码一起工作,然后停止了。无法获得 json 或响应中的任何内容。

query = "1000511"

url = 'https://ucr.gov/api/enforcement/{}'.format(query)


headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'en-US,en;q=0.9',
    'Referer': 'https://ucr.gov/enforcement/{}'.format(query),
    'Cache-Control': 'no-cache,no-store,must-revalidate,max-age=0,private',
    'content-type': 'application/json;charset=UTF-8',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-site',
    'UCR-UI-Version': '20.5.4',
    'Origin': 'https://ucr.gov',
    'Connection': 'keep-alive',
}

s = requests.Session()

params = (
    ('pageNumber', '0'),
    ('itemsPerPage', '15'),
)

response = s.get(url, headers=headers, params=params)

response.json()

预期的内容可以在这里找到:https://ucr.gov/enforcement/1000511

相反,我收到此错误:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

同样,这在几周前曾经有效。请帮我找出错误。

更正 1: - 我最初将 url 发布为:

url = 'https://admin.ucr.gov/api/enforcement/{}'.format(query)

这是以前的工作方式。现在,我看到该网站使用相同的 url 但没有 "admin" (上面的代码已更改)。但如果您访问 https://ucr.gov/enforcement/1000511

,我仍然没有得到预期的任何 results/content

使用(例如)Chrome's DevTools 你可以看到进行了以下调用:

然后您可以将其复制为 cUrl 并在需要 headers 的命令行上进行尝试:

$ curl 'https://admin.ucr.gov/api/enforcement' \
>   -H 'authority: admin.ucr.gov' \
>   -H 'accept: application/json, text/plain, */*' \
>   -H 'cache-control: no-cache,no-store,must-revalidate,max-age=0,private' \
>   -H 'ucr-ui-version: 20.5.4' \
>   -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' \
>   -H 'dnt: 1' \
>   -H 'content-type: application/json;charset=UTF-8' \
>   -H 'origin: https://ucr.gov' \
>   -H 'sec-fetch-site: same-site' \
>   -H 'sec-fetch-mode: cors' \
>   -H 'sec-fetch-dest: empty' \
>   -H 'referer: https://ucr.gov/enforcement/1000511' \
>   -H 'accept-language: it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7' \
>   --data-binary '{"searchTerm":"1000511","itemsPerPage":15,"pageNumber":0}' \
>   --compressed
{"carrier":{"usdot":1000511,"legalName":"877599 ALBERTA LTD","dateAdded":"2002-01-24T00:00:00Z","physicalAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"},"mailingAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"}},"history":{"enforcementRegistrations":[{"year":2020,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5166421+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2019,"status":"unregistered","updateTime":"2020-06-15T16:12:53.53272+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2018,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5486724+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2017,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5646021+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:12"}]}}

现在你可以尝试一个一个地删除headers,你会发现这个请求成功了:

$ curl 'https://admin.ucr.gov/api/enforcement' --data-binary '{"searchTerm":"1000511"}'    -H 'ucr-ui-version: 20.5.4'    -H 'content-type: application/json;charset=UTF-8'
{"carrier":{"usdot":1000511,"legalName":"877599 ALBERTA LTD","dateAdded":"2002-01-24T00:00:00Z","physicalAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"},"mailingAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"}},"history":{"enforcementRegistrations":[{"year":2020,"status":"unregistered","updateTime":"2020-06-15T16:20:41.3271743+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2019,"status":"unregistered","updateTime":"2020-06-15T16:20:41.3951487+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2018,"status":"unregistered","updateTime":"2020-06-15T16:20:41.468421+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2017,"status":"unregistered","updateTime":"2020-06-15T16:20:41.5511652+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:20"}]}}

现在将 curl 调用转换为 python。请注意,调用是 POST 而不是代码中的 GET!:

In [1]: import requests                                                                                                                                                                                                                                                                  

In [2]: import io 
   ...: response = requests.post('https://admin.ucr.gov/api/enforcement', data=io.StringIO('{"searchTerm":"1000511"}'), headers={'ucr-ui-version': '20.5.4', 'content-type': 'application/json;charset=UTF-8'})                                                                          

In [3]: response.status_code                                                                                                                                                                                                                                                             
Out[3]: 200

In [4]: response.json()                                                                                                                                                                                                                                                                  
Out[4]: 
{'carrier': {'usdot': 1000511,
  'legalName': '877599 ALBERTA LTD',
  'dateAdded': '2002-01-24T00:00:00Z',
  'physicalAddress': {'street': '430 66 STREET SW',
   'city': 'EDMONTON',
   'state': 'AB',
   'region': 'CAAB',
   'zipCode': 'T6X 1A3',
   'country': 'C',
   'countryCode': 'CA'},
  'mailingAddress': {'street': '430 66 STREET SW',
   'city': 'EDMONTON',
   'state': 'AB',
   'region': 'CAAB',
   'zipCode': 'T6X 1A3',
   'country': 'C',
   'countryCode': 'CA'}},
 'history': {'enforcementRegistrations': [{'year': 2020,
    'status': 'unregistered',
    'updateTime': '2020-06-15T16:23:10.2114276+00:00',
    'isApplicable': True,
    'isYearActive': True,
    'updateTimeDisplay': '06/15/2020 16:23'},
   {'year': 2019,
    'status': 'unregistered',
    'updateTime': '2020-06-15T16:23:10.278671+00:00',
    'isApplicable': True,
    'isYearActive': True,
    'updateTimeDisplay': '06/15/2020 16:23'},
   {'year': 2018,
    'status': 'unregistered',
    'updateTime': '2020-06-15T16:23:10.3507073+00:00',
    'isApplicable': True,
    'isYearActive': False,
    'updateTimeDisplay': '06/15/2020 16:23'},
   {'year': 2017,
    'status': 'unregistered',
    'updateTime': '2020-06-15T16:23:10.4026579+00:00',
    'isApplicable': True,
    'isYearActive': False,
    'updateTimeDisplay': '06/15/2020 16:23'}]}}