解析很长的文本
Parsing through a really long text
我完全是 python 的初学者,但我正在制作一个网络抓取工具作为一个项目。
我正在使用 jupyter notebook、beautifulsoup 和 lxml。
我设法获取了包含我需要的所有信息的文本,但现在我不知道该怎么做。
我想获取特定的数据,如经度、纬度、siteid、方向(北、南等),我想下载照片并重命名。我需要为所有 41 个位置执行此操作。
如果有人可以建议任何包或方法,我将非常感激!谢谢!
这是我抓取的一小部分文字(模式重复 41 次):
{
"count": 41,
"message": "success",
"results": [
{
"protocol": "land_covers",
"measuredDate": "2020-06-13",
"createDate": "2020-06-13T16:35:04",
"updateDate": "2020-06-15T14:00:10",
"publishDate": "2020-07-17T21:06:31",
"organizationId": 17043304,
"organizationName": "United States of America Citizen Science",
"siteId": 202689,
"siteName": "18TWK294769",
"countryName": null,
"countryCode": null,
"latitude": xx.xxx(edited),
"longitude": xx.xxx(edited),
"elevation": 25.4,
"pid": 163672280,
"data": {
"landcoversDownwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682247/original.jpg",
"landcoversEastExtraData": "(source: app, (compassData.horizon: -14.32171587255965))",
"landcoversEastPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682242/original.jpg",
"landcoversMucCode": null,
"landcoversUpwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682246/original.jpg",
"landcoversEastCaption": "",
"landcoversMeasurementLatitude": xx.xxx(edited),
"landcoversWestClassifications": null,
"landcoversNorthCaption": "",
"landcoversNorthExtraData": "(source: app, (compassData.horizon: -10.817734330181267))",
"landcoversDataSource": "GLOBE Observer App",
"landcoversDryGround": true,
"landcoversSouthClassifications": null,
"landcoversWestCaption": "",
"landcoversNorthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682241/original.jpg",
"landcoversUpwardCaption": "",
"landcoversDownwardExtraData": "(source: app, (compassData.horizon: -84.48900393488086))",
"landcoversEastClassifications": null,
"landcoversMucDetails": "",
"landcoversMeasuredAt": "2020-06-13T15:12:00",
"landcoversDownwardCaption": "",
"landcoversSouthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682243/original.jpg",
"landcoversMuddy": false,
"landcoversWestPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682245/original.jpg",
"landcoversStandingWater": false,
"landcoversLeavesOnTrees": true,
"landcoversUserid": 67150810,
"landcoversSouthExtraData": "(source: app, (compassData.horizon: -14.872806403121302))",
"landcoversSouthCaption": "",
"landcoversRainingSnowing": false,
"landcoversUpwardExtraData": "(source: app, (compassData.horizon: 89.09211989270894))",
"landcoversMeasurementElevation": 24.1,
"landcoversWestExtraData": "(source: app, (compassData.horizon: -15.47334477111039))",
"landcoversLandCoverId": 32043,
"landcoversMeasurementLongitude": xx.xxx(edited),
"landcoversMucDescription": null,
"landcoversSnowIce": false,
"landcoversNorthClassifications": null,
"landcoversFieldNotes": "(none)"
}
},
{
"protocol": "land_covers",
"measuredDate": "2020-06-13",
"createDate": "2020-06-13T16:35:04",
"updateDate": "2020-06-15T14:00:10",
"publishDate": "2020-07-17T21:06:31",
"organizationId": 17043304,
"organizationName": "United States of America Citizen Science",
"siteId": 202689,
"siteName": "18TWK294769",
"countryName": null,
"countryCode": null,
"latitude": xx.xxx(edited),
"longitude": xx.xxx(edited),
"elevation": 25.4,
"pid": 163672280,
"data": {
"landcoversDownwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682240/original.jpg",
"landcoversEastExtraData": "(source: app, (compassData.horizon: -6.06710116543897))",
"landcoversEastPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682235/original.jpg",
"landcoversMucCode": null,
"landcoversUpwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682239/original.jpg",
"landcoversEastCaption": "",
"landcoversMeasurementLatitude": xx.xxx(edited),
"landcoversWestClassifications": null,
"landcoversNorthCaption": "",
"landcoversNorthExtraData": "(source: app, (compassData.horizon: -9.199031748908894))",
"landcoversDataSource": "GLOBE Observer App",
"landcoversDryGround": true,
"landcoversSouthClassifications": null,
"landcoversWestCaption": "",
"landcoversNorthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682233/original.jpg",
"landcoversUpwardCaption": "",
"landcoversDownwardExtraData": "(source: app, (compassData.horizon: -88.86569321651771))",
"landcoversEastClassifications": null,
"landcoversMucDetails": "",
"landcoversMeasuredAt": "2020-06-13T15:07:00",
"landcoversDownwardCaption": "",
"landcoversSouthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682236/original.jpg",
"landcoversMuddy": false,
"landcoversWestPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682237/original.jpg",
"landcoversStandingWater": false,
"landcoversLeavesOnTrees": true,
"landcoversUserid": 67150810,
"landcoversSouthExtraData": "(source: app, (compassData.horizon: -11.615041431350335))",
"landcoversSouthCaption": "",
"landcoversRainingSnowing": false,
"landcoversUpwardExtraData": "(source: app, (compassData.horizon: 86.6284079864236))",
"landcoversMeasurementElevation": 24,
"landcoversWestExtraData": "(source: app, (compassData.horizon: -9.251774266832626))",
"landcoversLandCoverId": 32042,
"landcoversMeasurementLongitude": xx.xxx(edited),
"landcoversMucDescription": null,
"landcoversSnowIce": false,
"landcoversNorthClassifications": null,
"landcoversFieldNotes": "(none)"
}
},
看到一些代码会有所帮助。话虽如此,正如已经指出的那样,内置 json 库会对您有所帮助。这是一个JSON格式的输出,关于这种格式的介绍请看here
看在你这里的输出存储在一个名为 data
的变量中。您可以将此 json 数据转换为字典。
编码示例
import json
data_dict = json.load(data)
json.load 所做的是获取一个 JSON 对象并将其转换为 python 字典。 json.load 实际上扫描变量以检查它是否是 JSON 对象并使用转换 table 将其转换为字典。还有其他 json 格式可以转换为其他 python 对象类型。请参阅 here table。
现在您有一个 python 字典,您可以从中访问数据。因此,让我们通过经度、纬度、siteid、方向(北、南等)。我看到有一个开放的“[”,但没有对应的“]”。根据您的描述,我只能假设该列表中有 41 个项目,因此我将首先获取第一个结果。您总是可以很容易地遍历它以获得所有 41 个结果。
longitude = data_dict['results'][0]['longitude']
langitude = data_dict['results'][0]['langitude']
site_id = data_dict['results'][0]['siteid']
提示
- 我总是使用 jupyter notebooks 作为尝试从 JSON 对象中获取我想要的特定数据的快速方法,有时可能需要一些时间才能正确访问正确的部分。这样,当我编写变量时,我知道我从 JSON 对象中获取了我想要的数据。 Json 对象有时会嵌套很深,很难理解。
我完全是 python 的初学者,但我正在制作一个网络抓取工具作为一个项目。 我正在使用 jupyter notebook、beautifulsoup 和 lxml。
我设法获取了包含我需要的所有信息的文本,但现在我不知道该怎么做。 我想获取特定的数据,如经度、纬度、siteid、方向(北、南等),我想下载照片并重命名。我需要为所有 41 个位置执行此操作。 如果有人可以建议任何包或方法,我将非常感激!谢谢!
这是我抓取的一小部分文字(模式重复 41 次):
{
"count": 41,
"message": "success",
"results": [
{
"protocol": "land_covers",
"measuredDate": "2020-06-13",
"createDate": "2020-06-13T16:35:04",
"updateDate": "2020-06-15T14:00:10",
"publishDate": "2020-07-17T21:06:31",
"organizationId": 17043304,
"organizationName": "United States of America Citizen Science",
"siteId": 202689,
"siteName": "18TWK294769",
"countryName": null,
"countryCode": null,
"latitude": xx.xxx(edited),
"longitude": xx.xxx(edited),
"elevation": 25.4,
"pid": 163672280,
"data": {
"landcoversDownwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682247/original.jpg",
"landcoversEastExtraData": "(source: app, (compassData.horizon: -14.32171587255965))",
"landcoversEastPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682242/original.jpg",
"landcoversMucCode": null,
"landcoversUpwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682246/original.jpg",
"landcoversEastCaption": "",
"landcoversMeasurementLatitude": xx.xxx(edited),
"landcoversWestClassifications": null,
"landcoversNorthCaption": "",
"landcoversNorthExtraData": "(source: app, (compassData.horizon: -10.817734330181267))",
"landcoversDataSource": "GLOBE Observer App",
"landcoversDryGround": true,
"landcoversSouthClassifications": null,
"landcoversWestCaption": "",
"landcoversNorthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682241/original.jpg",
"landcoversUpwardCaption": "",
"landcoversDownwardExtraData": "(source: app, (compassData.horizon: -84.48900393488086))",
"landcoversEastClassifications": null,
"landcoversMucDetails": "",
"landcoversMeasuredAt": "2020-06-13T15:12:00",
"landcoversDownwardCaption": "",
"landcoversSouthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682243/original.jpg",
"landcoversMuddy": false,
"landcoversWestPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682245/original.jpg",
"landcoversStandingWater": false,
"landcoversLeavesOnTrees": true,
"landcoversUserid": 67150810,
"landcoversSouthExtraData": "(source: app, (compassData.horizon: -14.872806403121302))",
"landcoversSouthCaption": "",
"landcoversRainingSnowing": false,
"landcoversUpwardExtraData": "(source: app, (compassData.horizon: 89.09211989270894))",
"landcoversMeasurementElevation": 24.1,
"landcoversWestExtraData": "(source: app, (compassData.horizon: -15.47334477111039))",
"landcoversLandCoverId": 32043,
"landcoversMeasurementLongitude": xx.xxx(edited),
"landcoversMucDescription": null,
"landcoversSnowIce": false,
"landcoversNorthClassifications": null,
"landcoversFieldNotes": "(none)"
}
},
{
"protocol": "land_covers",
"measuredDate": "2020-06-13",
"createDate": "2020-06-13T16:35:04",
"updateDate": "2020-06-15T14:00:10",
"publishDate": "2020-07-17T21:06:31",
"organizationId": 17043304,
"organizationName": "United States of America Citizen Science",
"siteId": 202689,
"siteName": "18TWK294769",
"countryName": null,
"countryCode": null,
"latitude": xx.xxx(edited),
"longitude": xx.xxx(edited),
"elevation": 25.4,
"pid": 163672280,
"data": {
"landcoversDownwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682240/original.jpg",
"landcoversEastExtraData": "(source: app, (compassData.horizon: -6.06710116543897))",
"landcoversEastPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682235/original.jpg",
"landcoversMucCode": null,
"landcoversUpwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682239/original.jpg",
"landcoversEastCaption": "",
"landcoversMeasurementLatitude": xx.xxx(edited),
"landcoversWestClassifications": null,
"landcoversNorthCaption": "",
"landcoversNorthExtraData": "(source: app, (compassData.horizon: -9.199031748908894))",
"landcoversDataSource": "GLOBE Observer App",
"landcoversDryGround": true,
"landcoversSouthClassifications": null,
"landcoversWestCaption": "",
"landcoversNorthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682233/original.jpg",
"landcoversUpwardCaption": "",
"landcoversDownwardExtraData": "(source: app, (compassData.horizon: -88.86569321651771))",
"landcoversEastClassifications": null,
"landcoversMucDetails": "",
"landcoversMeasuredAt": "2020-06-13T15:07:00",
"landcoversDownwardCaption": "",
"landcoversSouthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682236/original.jpg",
"landcoversMuddy": false,
"landcoversWestPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682237/original.jpg",
"landcoversStandingWater": false,
"landcoversLeavesOnTrees": true,
"landcoversUserid": 67150810,
"landcoversSouthExtraData": "(source: app, (compassData.horizon: -11.615041431350335))",
"landcoversSouthCaption": "",
"landcoversRainingSnowing": false,
"landcoversUpwardExtraData": "(source: app, (compassData.horizon: 86.6284079864236))",
"landcoversMeasurementElevation": 24,
"landcoversWestExtraData": "(source: app, (compassData.horizon: -9.251774266832626))",
"landcoversLandCoverId": 32042,
"landcoversMeasurementLongitude": xx.xxx(edited),
"landcoversMucDescription": null,
"landcoversSnowIce": false,
"landcoversNorthClassifications": null,
"landcoversFieldNotes": "(none)"
}
},
看到一些代码会有所帮助。话虽如此,正如已经指出的那样,内置 json 库会对您有所帮助。这是一个JSON格式的输出,关于这种格式的介绍请看here
看在你这里的输出存储在一个名为 data
的变量中。您可以将此 json 数据转换为字典。
编码示例
import json
data_dict = json.load(data)
json.load 所做的是获取一个 JSON 对象并将其转换为 python 字典。 json.load 实际上扫描变量以检查它是否是 JSON 对象并使用转换 table 将其转换为字典。还有其他 json 格式可以转换为其他 python 对象类型。请参阅 here table。
现在您有一个 python 字典,您可以从中访问数据。因此,让我们通过经度、纬度、siteid、方向(北、南等)。我看到有一个开放的“[”,但没有对应的“]”。根据您的描述,我只能假设该列表中有 41 个项目,因此我将首先获取第一个结果。您总是可以很容易地遍历它以获得所有 41 个结果。
longitude = data_dict['results'][0]['longitude']
langitude = data_dict['results'][0]['langitude']
site_id = data_dict['results'][0]['siteid']
提示
- 我总是使用 jupyter notebooks 作为尝试从 JSON 对象中获取我想要的特定数据的快速方法,有时可能需要一些时间才能正确访问正确的部分。这样,当我编写变量时,我知道我从 JSON 对象中获取了我想要的数据。 Json 对象有时会嵌套很深,很难理解。