解析很长的文本

Parsing through a really long text

我完全是 python 的初学者,但我正在制作一个网络抓取工具作为一个项目。 我正在使用 jupyter notebook、beautifulsoup 和 lxml。

我设法获取了包含我需要的所有信息的文本,但现在我不知道该怎么做。 我想获取特定的数据,如经度、纬度、siteid、方向(北、南等),我想下载照片并重命名。我需要为所有 41 个位置执行此操作。 如果有人可以建议任何包或方法,我将非常感激!谢谢!

这是我抓取的一小部分文字(模式重复 41 次):

{
  "count": 41,
  "message": "success",
  "results": [
    {
      "protocol": "land_covers",
      "measuredDate": "2020-06-13",
      "createDate": "2020-06-13T16:35:04",
      "updateDate": "2020-06-15T14:00:10",
      "publishDate": "2020-07-17T21:06:31",
      "organizationId": 17043304,
      "organizationName": "United States of America Citizen Science",
      "siteId": 202689,
      "siteName": "18TWK294769",
      "countryName": null,
      "countryCode": null,
      "latitude": xx.xxx(edited),
      "longitude": xx.xxx(edited),
      "elevation": 25.4,
      "pid": 163672280,
      "data": {
        "landcoversDownwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682247/original.jpg",
        "landcoversEastExtraData": "(source: app, (compassData.horizon: -14.32171587255965))",
        "landcoversEastPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682242/original.jpg",
        "landcoversMucCode": null,
        "landcoversUpwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682246/original.jpg",
        "landcoversEastCaption": "",
        "landcoversMeasurementLatitude": xx.xxx(edited),
        "landcoversWestClassifications": null,
        "landcoversNorthCaption": "",
        "landcoversNorthExtraData": "(source: app, (compassData.horizon: -10.817734330181267))",
        "landcoversDataSource": "GLOBE Observer App",
        "landcoversDryGround": true,
        "landcoversSouthClassifications": null,
        "landcoversWestCaption": "",
        "landcoversNorthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682241/original.jpg",
        "landcoversUpwardCaption": "",
        "landcoversDownwardExtraData": "(source: app, (compassData.horizon: -84.48900393488086))",
        "landcoversEastClassifications": null,
        "landcoversMucDetails": "",
        "landcoversMeasuredAt": "2020-06-13T15:12:00",
        "landcoversDownwardCaption": "",
        "landcoversSouthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682243/original.jpg",
        "landcoversMuddy": false,
        "landcoversWestPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682245/original.jpg",
        "landcoversStandingWater": false,
        "landcoversLeavesOnTrees": true,
        "landcoversUserid": 67150810,
        "landcoversSouthExtraData": "(source: app, (compassData.horizon: -14.872806403121302))",
        "landcoversSouthCaption": "",
        "landcoversRainingSnowing": false,
        "landcoversUpwardExtraData": "(source: app, (compassData.horizon: 89.09211989270894))",
        "landcoversMeasurementElevation": 24.1,
        "landcoversWestExtraData": "(source: app, (compassData.horizon: -15.47334477111039))",
        "landcoversLandCoverId": 32043,
        "landcoversMeasurementLongitude": xx.xxx(edited),
        "landcoversMucDescription": null,
        "landcoversSnowIce": false,
        "landcoversNorthClassifications": null,
        "landcoversFieldNotes": "(none)"
      }
    },
    {
      "protocol": "land_covers",
      "measuredDate": "2020-06-13",
      "createDate": "2020-06-13T16:35:04",
      "updateDate": "2020-06-15T14:00:10",
      "publishDate": "2020-07-17T21:06:31",
      "organizationId": 17043304,
      "organizationName": "United States of America Citizen Science",
      "siteId": 202689,
      "siteName": "18TWK294769",
      "countryName": null,
      "countryCode": null,
      "latitude": xx.xxx(edited),
      "longitude": xx.xxx(edited),
      "elevation": 25.4,
      "pid": 163672280,
      "data": {
        "landcoversDownwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682240/original.jpg",
        "landcoversEastExtraData": "(source: app, (compassData.horizon: -6.06710116543897))",
        "landcoversEastPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682235/original.jpg",
        "landcoversMucCode": null,
        "landcoversUpwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682239/original.jpg",
        "landcoversEastCaption": "",
        "landcoversMeasurementLatitude": xx.xxx(edited),
        "landcoversWestClassifications": null,
        "landcoversNorthCaption": "",
        "landcoversNorthExtraData": "(source: app, (compassData.horizon: -9.199031748908894))",
        "landcoversDataSource": "GLOBE Observer App",
        "landcoversDryGround": true,
        "landcoversSouthClassifications": null,
        "landcoversWestCaption": "",
        "landcoversNorthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682233/original.jpg",
        "landcoversUpwardCaption": "",
        "landcoversDownwardExtraData": "(source: app, (compassData.horizon: -88.86569321651771))",
        "landcoversEastClassifications": null,
        "landcoversMucDetails": "",
        "landcoversMeasuredAt": "2020-06-13T15:07:00",
        "landcoversDownwardCaption": "",
        "landcoversSouthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682236/original.jpg",
        "landcoversMuddy": false,
        "landcoversWestPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682237/original.jpg",
        "landcoversStandingWater": false,
        "landcoversLeavesOnTrees": true,
        "landcoversUserid": 67150810,
        "landcoversSouthExtraData": "(source: app, (compassData.horizon: -11.615041431350335))",
        "landcoversSouthCaption": "",
        "landcoversRainingSnowing": false,
        "landcoversUpwardExtraData": "(source: app, (compassData.horizon: 86.6284079864236))",
        "landcoversMeasurementElevation": 24,
        "landcoversWestExtraData": "(source: app, (compassData.horizon: -9.251774266832626))",
        "landcoversLandCoverId": 32042,
        "landcoversMeasurementLongitude": xx.xxx(edited),
        "landcoversMucDescription": null,
        "landcoversSnowIce": false,
        "landcoversNorthClassifications": null,
        "landcoversFieldNotes": "(none)"
      }
    },

看到一些代码会有所帮助。话虽如此,正如已经指出的那样,内置 json 库会对您有所帮助。这是一个JSON格式的输出,关于这种格式的介绍请看here

看在你这里的输出存储在一个名为 data 的变量中。您可以将此 json 数据转换为字典。

编码示例

import json
data_dict = json.load(data)

json.load 所做的是获取一个 JSON 对象并将其转换为 python 字典。 json.load 实际上扫描变量以检查它是否是 JSON 对象并使用转换 table 将其转换为字典。还有其他 json 格式可以转换为其他 python 对象类型。请参阅 here table。

现在您有一个 python 字典,您可以从中访问数据。因此,让我们通过经度、纬度、siteid、方向(北、南等)。我看到有一个开放的“[”,但没有对应的“]”。根据您的描述,我只能假设该列表中有 41 个项目,因此我将首先获取第一个结果。您总是可以很容易地遍历它以获得所有 41 个结果。

longitude = data_dict['results'][0]['longitude']
langitude = data_dict['results'][0]['langitude']
site_id = data_dict['results'][0]['siteid']

提示

  1. 我总是使用 jupyter notebooks 作为尝试从 JSON 对象中获取我想要的特定数据的快速方法,有时可能需要一些时间才能正确访问正确的部分。这样,当我编写变量时,我知道我从 JSON 对象中获取了我想要的数据。 Json 对象有时会嵌套很深,很难理解。