在 python 中遍历复杂字典的更简单方法?

Simpler way to traverse complex dictionary in python?

我正在迭代作为字典加载到 python 中的复杂 json 对象。下面是 json 文件的示例。对感兴趣的数据进行了注释。

{  
   "name":"ns1:timeSeriesResponseType",
   "nil":false,
   "value":{  
      "queryInfo":{  },
      "timeSeries":[  
         {  
            "variable":{  },
            "values":[  
               {  
                  "qualifier":[  ],
                  "censorCode":[  ],
                  "value":[  
                     {  
                        "codedVocabularyTerm":null,
                        "censorCode":null,
                        "offsetTypeID":null,
                        "accuracyStdDev":null,
                        "timeOffset":null,
                        "qualifiers":[  
                           "P",                      # data of interest
                           "Ice"                     # data of interest
                        ],
                        "qualityControlLevelCode":null,
                        "sampleID":null,
                        "dateTimeAccuracyCd":null,
                        "methodCode":null,
                        "codedVocabulary":null,
                        "sourceID":null,
                        "oid":null,
                        "dateTimeUTC":null,
                        "offsetValue":null,
                        "metadataTime":null,
                        "labSampleCode":null,
                        "methodID":null,
                        "value":"-999999",
                        "dateTime":"2015-02-24T03:30:00.000-05:00",
                        "offsetTypeCode":null,
                        "sourceCode":null
                     },
                     {  
                        "codedVocabularyTerm":null,
                        "censorCode":null,
                        "offsetTypeID":null,
                        "accuracyStdDev":null,
                        "timeOffset":null,
                        "qualifiers":[  ],
                        "qualityControlLevelCode":null,
                        "sampleID":null,
                        "dateTimeAccuracyCd":null,
                        "methodCode":null,
                        "codedVocabulary":null,
                        "sourceID":null,
                        "oid":null,
                        "dateTimeUTC":null,
                        "offsetValue":null,
                        "metadataTime":null,
                        "labSampleCode":null,
                        "methodID":null,
                        "value":"-999999",                          # data of interest
                        "dateTime":"2015-02-24T04:00:00.000-05:00", # data of interest
                        "offsetTypeCode":null,
                        "sourceCode":null
                     }
                  ],
                  "sample":[  ],
                  "source":[  ],
                  "offset":[  ],
                  "units":null,
                  "qualityControlLevel":[  ],
                  "method":[  ]
               }
            ],
            "sourceInfo":{  },
            "name":"USGS:03193000:00060:00011"
         },
         {  },  # more data need is stored in here
         {  },  # more data need is stored in here
         {  }   # more data need is stored in here
      ]
   },
   "declaredType":"org.cuahsi.waterml.TimeSeriesResponseType",
   "scope":"javax.xml.bind.JAXBElement$GlobalScope",
   "globalScope":true,
   "typeSubstituted":false
}

这里是我的代码,用于在字典上步进 through/iterating 以获取我想要的数据并将其存储在格式更简单的字典中:

# Setting up blank variables to store results
outputDict = {}
outputList = []
dateTimeList = []
valueList = []
qualifiersList = [[]]


for key in result["value"]["timeSeries"]:
    for key2 in key:
        if key2 == "values":
            for key3 in key.get(key2):
                for key4 in key3:
                    if key4 == "value":
                        for key5 in key3.get(key4):
                            for key6 in key5:
                                if key6 == "value":
                                    valueList.append(key5.get(key6))
                                if key6 == "dateTime":
                                    dateTimeList.append(key5.get(key6))
                        #print key.get("name")
                        #outputDict[key.get("name")]["dateTime"] = dateTimeList
                        #outputDict[key.get("name")]["values"] = valueList

        if key2 == "name":
            outputList.append(key.get(key2))
            outputDict[key.get(key2)]={"dateTime":None, "values":None, "qualifiers":None}
            outputDict[key.get("name")]["dateTime"] = dateTimeList
            outputDict[key.get("name")]["values"] = valueList
            del dateTimeList[:]
            del valueList[:]

我的问题是 - 作为 python 的新手,任何人都可以指出我的代码中任何明显的低效之处吗?我可以指望 json 文件在几个月(可能几年)内不会改变结构,所以我相信我最初使用 for key in result["value"][ "timeSeries"]: 很好,但我不确定很多很多 for 循环是不必要的还是低效的。有没有一种简单的方法可以从这样一个分层字典中搜索出 return key:value 对,字典列表中有字典列表?

编辑:

基于@Alex Martelli 提供的解决方案,这里是新的、更高效的、精简版的代码:

# Building the output dictionary
for key in result["value"]["timeSeries"]:
    if "values" in key:
        for key2 in key.get("values"):
            if "value" in key2:
                for key3 in key2.get("value"):
                    if "value" in key3:
                        valueList.append(key3.get("value"))
                    if "dateTime" in key3:
                        dateTimeList.append(key3.get("dateTime"))
                    if "qualifiers" in key3:
                        qualifiersList.append(key3.get("qualifiers"))

    if "name" in key:
        outputList.append(key.get("name"))
        outputDict[key.get("name")]={"dateTime":None, "values":None, "qualifiers":None}
        outputDict[key.get("name")]["dateTime"] = dateTimeList[:]    # passing the items in the list rather
        outputDict[key.get("name")]["values"] = valueList[:]         # than a reference to the list so the delete works
        outputDict[key.get("name")]["qualifiers"] = qualifiersList[:]         # than a reference to the list so the delete works
        del dateTimeList[:]
        del valueList[:]
        del qualifiersList[:]

效果相同,删除了 4 行代码。更快 运行 时间。不错

编辑:

根据@Two-Bit Alchemist 提出的解决方案,同样有效:

# Building the output dictionary
    for key in result["value"]["timeSeries"]:
        print key
        for value in key["values"][0]["value"]:
            # qualifiers is a list containing ["P", "Ice"]
            qualifiersList.append(value['qualifiers'])
            valueList.append(value['value'])
            dateTimeList.append(value['dateTime'])


        if "name" in key:
            outputList.append(key.get("name"))
            outputDict[key.get("name")]={"dateTime":None, "values":None, "qualifiers":None}
            outputDict[key.get("name")]["dateTime"] = dateTimeList[:]    # passing the items in the list rather
            outputDict[key.get("name")]["values"] = valueList[:]         # than a reference to the list so the delete works
            outputDict[key.get("name")]["qualifiers"] = qualifiersList[:]         # than a reference to the list so the delete works
            del dateTimeList[:]
            del valueList[:]
            del qualifiersList[:]

我看到的唯一问题是我永远不能完全确定 ["values"] 列表中的第一个位置就是我想要的。而且我丢失了 "if" 语句提供的检查,如果值是 return 从错误的查询语句中编辑的,检查应该确保不会引入错误。

编辑:

try:

    # requests.get returns a "file-like" object
    # in this case it is a JSON object because of the settings in the query
    response = requests.get(url=query)


    # if-else ladder that only performs the parsing of the returned JSON object
    # when the HTTP status code indicates a successful query execution
    if(response.status_code == 200):

        # parsing the
        result = response.json()

        # Setting up blank variables to store results
        outputDict = {}
        outputList = []
        dateTimeList = []
        valueList = []
        qualifiersList = []


        # Building the output dictionary
        for key in result["value"]["timeSeries"]:
            print key
            for value in key["values"][0]["value"]:
                # qualifiers is a list containing ["P", "Ice"]
                qualifiersList.append(value['qualifiers'])
                valueList.append(value['value'])
                dateTimeList.append(value['dateTime'])

            # OLD CODE   
            # if "values" in key:
            #     for key2 in key.get("values"):
            #         if "value" in key2:
            #             for key3 in key2.get("value"):
            #                 if "value" in key3:
            #                     valueList.append(key3.get("value"))
            #                 if "dateTime" in key3:
            #                     dateTimeList.append(key3.get("dateTime"))
            #                 if "qualifiers" in key3:
            #                     qualifiersList.append(key3.get("qualifiers"))

            if "name" in key:
                outputList.append(key.get("name"))
                outputDict[key.get("name")]={"dateTime":None, "values":None, "qualifiers":None}
                outputDict[key.get("name")]["dateTime"] = dateTimeList[:]    # passing the items in the list rather
                outputDict[key.get("name")]["values"] = valueList[:]         # than a reference to the list so the delete works
                outputDict[key.get("name")]["qualifiers"] = qualifiersList[:]         # than a reference to the list so the delete works
                del dateTimeList[:]
                del valueList[:]
                del qualifiersList[:]


        # Tracking how long it took to process the data
        elapsed = time.time() - now
        print "Runtime: " + str(elapsed)

        out = {"Status": 'ok', "Results": [[{"myResult": outputDict}]]}

    elif(response.status_code == 400):
        raise Exception("Bad Request, "+ datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    elif(response.status_code== 403):
        raise Exception("Access Forbidden, "+ datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    elif(response.status_code == 404):
        raise Exception("Gage location(s) not Found, "+ datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    elif(response.status_code == 500):
        raise Exception("Internal Server Error, "+ datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    elif(response.status_code == 503):
        raise Exception("Service Unavailable, "+ datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    else:
        raise Exception("Unknown Response, "+ datetime.now().strftime('%Y-%m-%d %H:%M:%S'))



except:
    out = {"Status": 'Error', "Message": str(sys.exc_info()[1])}


print out

你问 "any obvious inefficiencies in my code" —— 答案是肯定的,特别是在你循环字典的地方(因此按顺序获取它们的所有键,即 O(N),即花费的时间与字典中键的数量)而不是仅仅将它们用作字典(这需要时间 O(1),即常数时间——也很快)。

所以例如你有

for key2 in key:
    if key2 == "values":
       ...use key.get(key2)...
    if key2 == "name":
       ...use key.get(key2)...

你应该改为:

if 'values' in key:
   ...use key['values']...
if 'name' in key:
   ...use key['name']...

和更深层次的类似构造。事情可以进一步优化,例如:

values = key.get('values')
if values is not None:
    ...use values...
name = key.get('name')
if name is not None:
    ...use name...

避免重复索引(同样,更深层次)。

根据我对这个问题的理解,我相信你最初对这个问题的令人费解的方法非常矫枉过正,已经给一个非常简单的解决方案蒙上了阴影。如果我仍然对此有误解并且过于简单化,请纠正我。即使这个结构非常复杂,如果它的可变部分是 timeSeries 处列表的长度,您可以访问该列表并迭代它,同时重复抓取您的 "data of interest"。我不知道这些数据是什么,无法为您提供一个很好的示例数据结构或什至是体面的变量名,以说明应该如何存储它以便稍后在您的程序中使用,所以我只是将它存储在一个大列表中列表只是为了向您展示我的意思:

data_of_interest = []
for data in json_structure['value']['timeSeries']:
    value_list = data['values'][0]['value']
    # qualifiers is a list containing ["P", "Ice"]
    qualifiers = value_list[0]['qualifiers']
    value = value_list[1]['value']
    dateTime = value_list[1]['dateTime']
    data_of_interest.append([qualifiers, value, dateTime])

如果在我硬编码索引 0 的其他地方有重复,只需在那里引入 for 循环,例如

for data in json_structure['value']['timeSeries']:
    for value_set in data['values']:
        for value_list in value_set['value']:
            # etc

如果您担心某些值不存在,请准备好从 dict 或等效项中获取 KeyError

比如我写的地方:

value = value_list[1]['value']

如果 value_list 没有 >= 2 个值,这可能会引发 IndexError,或者如果它的第二个元素不是字典映射 'value' 则引发 KeyError某物。您可以抓住其中一个或两个,并一起或单独处理它们,或者忽略它们并继续前进。

try:
    value = value_list[1]['value']
except KeyError:    # catch only one
    # do something

try:
    value = value_list[1]['value']
except (IndexError, KeyError):    # catch both
    # handle together

try:
    value = value_list[1]['value']
except IndexError:
    # handle IndexError
except KeyError:
    # handle KeyError

而您的 # handle whatever 代码很可能是 pass —— 这意味着 "I know this might happen, but don't freak out. Just keep reading." 如果您 抓住它们,异常将 "bubble up" 到执行上下文的顶部并使您的程序崩溃。