将嵌套的JSON(Dict,List)压平成List,准备写入DB
Flatten nested JSON (Dict, List) into List to prepare to write into DB
我仍在解决一个问题,将嵌套的 JSON 文件展平。嵌套项是 List 或 Dict:
这是我要展平的文件(与我之前的 post 不同,我把它保持在适当的长度,但它只包含 input[0],没有任何后续项目,因为它会很长) :
input = [{'states': ['USED'], 'niceName': '1-series', 'id': 'BMW_1_Series',
'years': [{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 100994560},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 100974974},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 100974975},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 100994561}
],
'states': ['USED'], 'id': 100524709, 'year': 2008},
{'styles':
[{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101082656},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101082655},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101082663},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101082662}
],
'states': ['USED'], 'id': 100503222, 'year': 2009},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101200599},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101200600},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101200607},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101200601}
],
'states': ['USED'], 'id': 100529091, 'year': 2010},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101288165},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101288166},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101288298},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101288297}
],
'states': ['USED'], 'id': 100531309, 'year': 2011},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101381667},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101381668},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101381665},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101381666}
],
'states': ['USED'], 'id': 100534729, 'year': 2012},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 200428722},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 200428721},
{'trim': '135is', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135is 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 200421701},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 200428724},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 200428723},
{'trim': '128i SULEV', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i SULEV 2dr Coupe (3.0L 6cyl 6M)', 'id': 200428726},
{'trim': '128i SULEV', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i SULEV 2dr Convertible (3.0L 6cyl 6M)', 'id': 200428725},
{'trim': '135is', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135is 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 200428727}
],
'states': ['USED'], 'id': 200421700, 'year': 2013}
],
'name': '1 Series', 'make': {'niceName': 'bmw', 'name': 'BMW', 'id': 200000081}
}, #here is more to come, but I needed to crop it
]
我的方法失败后到目前为止使用的代码是由@poke 编写的,来自:Flattening Generic JSON List of Dicts or Lists in Python
def splitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
'''
# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix is None else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }
# try to find the key holding the subobject or a list of subobjects
for k, v in new.items():
# list of subobjects
if isinstance(v, list):
del new[k]
return new, k, v
# or just one subobject
elif isinstance(v, dict):
del new[k]
return new, k, [v]
return new, None, None
def flatten (data, prefix = None):
'''
Flatten the data, optionally with each key prefixed.
'''
# iterate all items
for item in data:
# split the object
flat, key, subs = splitObj(item, prefix)
# just return fully flat objects
if key is None:
yield flat
continue
# otherwise recursively flatten the subobjects
for sub in flatten(subs, key):
sub.update(flat)
yield sub
我收到以下错误:
AttributeError: 'str' object has no attribute 'items'
结果来自 'states': ['USED']
我不知道该如何处理。键 'states' 可以保存为列表。
我希望有人能帮我解决这个问题。
Ps:这是 Python: Write Nested JSON as multiple elements in List
的跟进 post
这是我的 splitObj 解决方案
def splitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
obj needs to be a Dictonary
'''
# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix is None or prefix=="NotFlat" else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }
cL = 0
cD = 0
# try to find the key holding the subobject or a list of subobjects
for k, v in new.items():
#Determine the number of lists in v
if isinstance(v, list):
cL += 1
#Determine the number of dict in v
elif isinstance(v, dict):
cD += 1
for k, v in new.items():
# list of subobjects
if isinstance(v, list):
if (cD+cL) <=1:
try:
type(v[0])
except IndexError:
v = [""]
if not isinstance(v[0], str):
del new[k]
return new, k, v
elif isinstance(v[0], str):
#handle list when only containing strings, return, the whole thing
#solve other dicts which might be in the line
#use "NotFlat" to run loop again but without adding a prefix
new[k] = ", ".join(v)
return new, None, None
else:
custLog.logger.info("")
elif (cD+cL) >1:
#print("Count List2 CD: "+str(cD))
#print("Count LIST2 CL: "+str(cL))
#if list is empty
try:
type(v[0])
except IndexError:
v = [""]
if not isinstance(v[0], str):
del new[k]
for x in flatten([new]):
newOut = x
break
return newOut, k, v
elif isinstance(v[0], str):
#handle list when only containing strings, return, the whole thing
#solve other dicts which might be in the line
#use "NotFlat" to run loop again but without adding a prefix
new[k] = ", ".join(v)
return None, "NotFlat", [new]
else:
custLog.logger.error("weder noch 2")
# or just one subobject
elif isinstance(v, dict):
if (cD+cL) <=1:
del new[k]
return new, k, [v]
elif (cD+cL) >1:
del new[k]
for x in flatten([new]):
newOut = x
break
return newOut, k, [v]
return new, None, None
此处用于展平
def flatten (data, prefix = None):
'''
Flatten the data, optionally with each key prefixed.
'''
# iterate all items
for item in data:
# split the object
flat, key, subs = splitObj(item, prefix)
if subs is None:
if key is None:
yield flat
continue
# just return fully flat objects
if key is None and flat is not None:
yield flat
continue
# otherwise recursively flatten the subobjects
try:
for sub in flatten(subs, key):
if flat is not None:
sub.update(flat)
yield sub
except TypeError as e:
custLog.logger.error("ERR: TypeError"+str(e))
虽然不是通用函数,但考虑遍历每个嵌套元素以获得数据库导入或平面文件(csv、txt)导出的平面输出。由于 json 文件由字典和列表的组合组成,因此在每个级别相应地处理它们:
items = []
for outer in data:
inner = [''] * 15
for outerk, outerv in outer.items():
inner[0] = outer['states'][0]
inner[1] = outer['niceName']
inner[2] = outer['id']
inner[3] = outer['make']['niceName']
inner[4] = outer['make']['name']
inner[5] = outer['make']['id']
if outerk == 'years':
for yri in outer[outerk]:
for yrk, yrv in yri.items():
inner[6] = yri['states'][0]
inner[7] = yri['id']
inner[8] = yri['year']
if yrk == 'styles':
for stylei in yri[yrk]:
inner[9] = stylei['trim']
inner[10] = stylei['name']
inner[11] = stylei['id']
inner[12] = stylei['submodel']['body']
inner[13] = stylei['submodel']['niceName']
inner[14] = stylei['submodel']['modelName']
items.append(inner[0:14])
for i in items:
print(i)
输出 (父项对每个子项重复)
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Convertible (3.0L 6cyl 6M)', 100994560, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Coupe (3.0L 6cyl 6M)', 100974974, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 100974975, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 100994561, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100503222, 2009, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 101082656, 'Coupe', 'coupe']
重新思考问题
通常更容易找到更一般问题的解决方案。那么,让我们先仔细看看这个问题。
输入是一些 JSON 描述一组对象的文件。
对象被递归定义为原子(字符串或数字)或具有对象值的字典。列表用于表示备选方案(即列表中的任何元素都可以代替列表)。
例如 {a:[1,2]}
表示 a
可以是 1
或 2
.
输出应该是不包含任何选项的对象列表。此外,对象应该被展平,即应该是字典,其值是原子,其键描述原始对象中值的路径。
我的解决方案分别处理备选方案和展平。
规范化
下面的函数 normalise
接受 json.dumps
的输入并生成一系列字典。请注意,normalise
的输入和输出具有相同的语义并描述同一组对象。输出只是标准化,因为它只包含顶层的备选方案。数据库人员会称其为非规范化,因为它不适合关系数据库。
normalise
总是 return 对象序列。 normalise
作为生成器实现以保持低内存使用率。
normalise
区分了以下情况。
- 一个原子输入意味着只有一种可能。因此,产生了原子(这就像 returning 包含原子的列表)。
- 列表意味着备选方案的备选方案。它产生其规范化输入的所有元素(这就像连接列表)。
- 字典意味着我们必须考虑各个键的所有备选方案组合。因此,我们 return 备选方案的笛卡尔积。
代码如下:
import itertools
def normalise(x):
if isinstance(x, dict):
keys = x.keys()
values = (normalise(i) for i in x.values())
for i in itertools.product(*values):
yield (dict(zip(keys, i)))
elif isinstance(x, list):
#if not x: # uncomment for "LEFT JOIN" behaviour
# yield None
for i in x:
yield from normalise(i)
else:
yield x
如果此代码包含任何空列表,则它不会 return 一个对象。这是因为没有可能的值。这就像 SQL "INNER JOIN"。从 Bert 的回答来看,他似乎想要 "LEFT JOIN" 行为(即一些默认值)。要实现这一点,只需取消注释这两行。
伪扁平化
normalise
生成的对象仍然具有原始(嵌套)字典结构。可以使用其他讨论中的代码将它们展平。
但是,OP 想要将对象插入到数据库中。因此,他很可能不需要扁平化字典的键列表。他只需要一个函数 returning 给定路径的值。
这可以通过为具有 __getitem__
方法的字典创建包装器对象来实现。此包装器还可用于 return 不存在路径的默认值。
class DictWrapper:
def __init__(self, d, sep='.'):
self.d = d
self.sep = sep
def __getitem__(self, key):
ret = self.d
try:
for k in key.split(self.sep):
ret = ret[k]
return ret
except KeyError:
return None
sql 插入可以如下所示(使用 psycopg2 测试)
for i in normalise(input):
cur.execute('insert into mytable (year) VALUES (%(years.year)s)', DictWrapper(i))
实施细节
为了清楚起见,此实现显然牺牲了一些运行时性能。
可以使用抽象基础 类 代替 list
和 dict
。但是,这可能会有问题,因为 str
是一个序列,但应被视为原子。
DictWrapper
仅当 sep
不包含在任何字典键中时才能正常工作。
normalise
不会过滤掉重复项。这可以通过使用集合和命名元组而不是列表和字典来完成。然而,这意味着整个结果必须在内存中。在数据库级别过滤掉重复项可能会更好。
为了将内存使用保持在最低限度,应该延迟读取 JSON。
我仍在解决一个问题,将嵌套的 JSON 文件展平。嵌套项是 List 或 Dict:
这是我要展平的文件(与我之前的 post 不同,我把它保持在适当的长度,但它只包含 input[0],没有任何后续项目,因为它会很长) :
input = [{'states': ['USED'], 'niceName': '1-series', 'id': 'BMW_1_Series',
'years': [{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 100994560},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 100974974},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 100974975},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 100994561}
],
'states': ['USED'], 'id': 100524709, 'year': 2008},
{'styles':
[{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101082656},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101082655},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101082663},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101082662}
],
'states': ['USED'], 'id': 100503222, 'year': 2009},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101200599},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101200600},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101200607},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101200601}
],
'states': ['USED'], 'id': 100529091, 'year': 2010},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101288165},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101288166},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101288298},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101288297}
],
'states': ['USED'], 'id': 100531309, 'year': 2011},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 101381667},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 101381668},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 101381665},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 101381666}
],
'states': ['USED'], 'id': 100534729, 'year': 2012},
{'styles':
[{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i 2dr Coupe (3.0L 6cyl 6M)', 'id': 200428722},
{'trim': '128i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i 2dr Convertible (3.0L 6cyl 6M)', 'id': 200428721},
{'trim': '135is', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135is 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 200421701},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 'id': 200428724},
{'trim': '135i', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 200428723},
{'trim': '128i SULEV', 'states': ['USED'], 'submodel': {'body': 'Coupe', 'niceName': 'coupe', 'modelName': '1 Series Coupe'},
'name': '128i SULEV 2dr Coupe (3.0L 6cyl 6M)', 'id': 200428726},
{'trim': '128i SULEV', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '128i SULEV 2dr Convertible (3.0L 6cyl 6M)', 'id': 200428725},
{'trim': '135is', 'states': ['USED'], 'submodel': {'body': 'Convertible', 'niceName': 'convertible', 'modelName': '1 Series Convertible'},
'name': '135is 2dr Convertible (3.0L 6cyl Turbo 6M)', 'id': 200428727}
],
'states': ['USED'], 'id': 200421700, 'year': 2013}
],
'name': '1 Series', 'make': {'niceName': 'bmw', 'name': 'BMW', 'id': 200000081}
}, #here is more to come, but I needed to crop it
]
我的方法失败后到目前为止使用的代码是由@poke 编写的,来自:Flattening Generic JSON List of Dicts or Lists in Python
def splitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
'''
# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix is None else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }
# try to find the key holding the subobject or a list of subobjects
for k, v in new.items():
# list of subobjects
if isinstance(v, list):
del new[k]
return new, k, v
# or just one subobject
elif isinstance(v, dict):
del new[k]
return new, k, [v]
return new, None, None
def flatten (data, prefix = None):
'''
Flatten the data, optionally with each key prefixed.
'''
# iterate all items
for item in data:
# split the object
flat, key, subs = splitObj(item, prefix)
# just return fully flat objects
if key is None:
yield flat
continue
# otherwise recursively flatten the subobjects
for sub in flatten(subs, key):
sub.update(flat)
yield sub
我收到以下错误:
AttributeError: 'str' object has no attribute 'items'
结果来自 'states': ['USED']
我不知道该如何处理。键 'states' 可以保存为列表。
我希望有人能帮我解决这个问题。
Ps:这是 Python: Write Nested JSON as multiple elements in List
的跟进 post这是我的 splitObj 解决方案
def splitObj (obj, prefix = None):
'''
Split the object, returning a 3-tuple with the flat object, optionally
followed by the key for the subobjects and a list of those subobjects.
obj needs to be a Dictonary
'''
# copy the object, optionally add the prefix before each key
new = obj.copy() if prefix is None or prefix=="NotFlat" else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }
cL = 0
cD = 0
# try to find the key holding the subobject or a list of subobjects
for k, v in new.items():
#Determine the number of lists in v
if isinstance(v, list):
cL += 1
#Determine the number of dict in v
elif isinstance(v, dict):
cD += 1
for k, v in new.items():
# list of subobjects
if isinstance(v, list):
if (cD+cL) <=1:
try:
type(v[0])
except IndexError:
v = [""]
if not isinstance(v[0], str):
del new[k]
return new, k, v
elif isinstance(v[0], str):
#handle list when only containing strings, return, the whole thing
#solve other dicts which might be in the line
#use "NotFlat" to run loop again but without adding a prefix
new[k] = ", ".join(v)
return new, None, None
else:
custLog.logger.info("")
elif (cD+cL) >1:
#print("Count List2 CD: "+str(cD))
#print("Count LIST2 CL: "+str(cL))
#if list is empty
try:
type(v[0])
except IndexError:
v = [""]
if not isinstance(v[0], str):
del new[k]
for x in flatten([new]):
newOut = x
break
return newOut, k, v
elif isinstance(v[0], str):
#handle list when only containing strings, return, the whole thing
#solve other dicts which might be in the line
#use "NotFlat" to run loop again but without adding a prefix
new[k] = ", ".join(v)
return None, "NotFlat", [new]
else:
custLog.logger.error("weder noch 2")
# or just one subobject
elif isinstance(v, dict):
if (cD+cL) <=1:
del new[k]
return new, k, [v]
elif (cD+cL) >1:
del new[k]
for x in flatten([new]):
newOut = x
break
return newOut, k, [v]
return new, None, None
此处用于展平
def flatten (data, prefix = None):
'''
Flatten the data, optionally with each key prefixed.
'''
# iterate all items
for item in data:
# split the object
flat, key, subs = splitObj(item, prefix)
if subs is None:
if key is None:
yield flat
continue
# just return fully flat objects
if key is None and flat is not None:
yield flat
continue
# otherwise recursively flatten the subobjects
try:
for sub in flatten(subs, key):
if flat is not None:
sub.update(flat)
yield sub
except TypeError as e:
custLog.logger.error("ERR: TypeError"+str(e))
虽然不是通用函数,但考虑遍历每个嵌套元素以获得数据库导入或平面文件(csv、txt)导出的平面输出。由于 json 文件由字典和列表的组合组成,因此在每个级别相应地处理它们:
items = []
for outer in data:
inner = [''] * 15
for outerk, outerv in outer.items():
inner[0] = outer['states'][0]
inner[1] = outer['niceName']
inner[2] = outer['id']
inner[3] = outer['make']['niceName']
inner[4] = outer['make']['name']
inner[5] = outer['make']['id']
if outerk == 'years':
for yri in outer[outerk]:
for yrk, yrv in yri.items():
inner[6] = yri['states'][0]
inner[7] = yri['id']
inner[8] = yri['year']
if yrk == 'styles':
for stylei in yri[yrk]:
inner[9] = stylei['trim']
inner[10] = stylei['name']
inner[11] = stylei['id']
inner[12] = stylei['submodel']['body']
inner[13] = stylei['submodel']['niceName']
inner[14] = stylei['submodel']['modelName']
items.append(inner[0:14])
for i in items:
print(i)
输出 (父项对每个子项重复)
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Convertible (3.0L 6cyl 6M)', 100994560, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '128i', '128i 2dr Coupe (3.0L 6cyl 6M)', 100974974, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 100974975, 'Coupe', 'coupe']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100524709, 2008, '135i', '135i 2dr Convertible (3.0L 6cyl Turbo 6M)', 100994561, 'Convertible', 'convertible']
# ['USED', '1-series', 'BMW_1_Series', 'bmw', 'BMW', 200000081, 'USED', 100503222, 2009, '135i', '135i 2dr Coupe (3.0L 6cyl Turbo 6M)', 101082656, 'Coupe', 'coupe']
重新思考问题
通常更容易找到更一般问题的解决方案。那么,让我们先仔细看看这个问题。
输入是一些 JSON 描述一组对象的文件。
对象被递归定义为原子(字符串或数字)或具有对象值的字典。列表用于表示备选方案(即列表中的任何元素都可以代替列表)。
例如 {a:[1,2]}
表示 a
可以是 1
或 2
.
输出应该是不包含任何选项的对象列表。此外,对象应该被展平,即应该是字典,其值是原子,其键描述原始对象中值的路径。
我的解决方案分别处理备选方案和展平。
规范化
下面的函数 normalise
接受 json.dumps
的输入并生成一系列字典。请注意,normalise
的输入和输出具有相同的语义并描述同一组对象。输出只是标准化,因为它只包含顶层的备选方案。数据库人员会称其为非规范化,因为它不适合关系数据库。
normalise
总是 return 对象序列。 normalise
作为生成器实现以保持低内存使用率。
normalise
区分了以下情况。
- 一个原子输入意味着只有一种可能。因此,产生了原子(这就像 returning 包含原子的列表)。
- 列表意味着备选方案的备选方案。它产生其规范化输入的所有元素(这就像连接列表)。
- 字典意味着我们必须考虑各个键的所有备选方案组合。因此,我们 return 备选方案的笛卡尔积。
代码如下:
import itertools
def normalise(x):
if isinstance(x, dict):
keys = x.keys()
values = (normalise(i) for i in x.values())
for i in itertools.product(*values):
yield (dict(zip(keys, i)))
elif isinstance(x, list):
#if not x: # uncomment for "LEFT JOIN" behaviour
# yield None
for i in x:
yield from normalise(i)
else:
yield x
如果此代码包含任何空列表,则它不会 return 一个对象。这是因为没有可能的值。这就像 SQL "INNER JOIN"。从 Bert 的回答来看,他似乎想要 "LEFT JOIN" 行为(即一些默认值)。要实现这一点,只需取消注释这两行。
伪扁平化
normalise
生成的对象仍然具有原始(嵌套)字典结构。可以使用其他讨论中的代码将它们展平。
但是,OP 想要将对象插入到数据库中。因此,他很可能不需要扁平化字典的键列表。他只需要一个函数 returning 给定路径的值。
这可以通过为具有 __getitem__
方法的字典创建包装器对象来实现。此包装器还可用于 return 不存在路径的默认值。
class DictWrapper:
def __init__(self, d, sep='.'):
self.d = d
self.sep = sep
def __getitem__(self, key):
ret = self.d
try:
for k in key.split(self.sep):
ret = ret[k]
return ret
except KeyError:
return None
sql 插入可以如下所示(使用 psycopg2 测试)
for i in normalise(input):
cur.execute('insert into mytable (year) VALUES (%(years.year)s)', DictWrapper(i))
实施细节
为了清楚起见,此实现显然牺牲了一些运行时性能。
可以使用抽象基础 类 代替
list
和dict
。但是,这可能会有问题,因为str
是一个序列,但应被视为原子。DictWrapper
仅当sep
不包含在任何字典键中时才能正常工作。normalise
不会过滤掉重复项。这可以通过使用集合和命名元组而不是列表和字典来完成。然而,这意味着整个结果必须在内存中。在数据库级别过滤掉重复项可能会更好。为了将内存使用保持在最低限度,应该延迟读取 JSON。