在 pandas 数据帧中加载多文档 yaml 文件时缺少第一个文档
Missing first document when loading multi-document yaml file in pandas dataframe
我试图将一个多文档 yaml 文件(即由多个 yaml 文档组成的 yaml 文件,用“---”分隔)加载到 Pandas 数据帧中。由于某种原因,第一个文档并没有出现在数据框中。如果 yaml.safe_load_all
的输出首先具体化为一个列表(而不是将迭代器提供给 pd.io.json.json_normalize
),所有文档最终都会出现在数据框中。我可以使用下面的示例代码(在完全不同的 yaml 文件上)重现它。
import os
import yaml
import pandas as pd
import urllib.request
# public example of multi-document yaml
inputfilepath = os.path.expanduser("~/my_example.yaml")
url = "https://raw.githubusercontent.com/kubernetes/examples/master/guestbook/all-in-one/guestbook-all-in-one.yaml"
urllib.request.urlretrieve(url, inputfilepath)
with open(inputfilepath, 'r') as stream:
df1 = pd.io.json.json_normalize(yaml.safe_load_all(stream))
with open(inputfilepath, 'r') as stream:
df2 = pd.io.json.json_normalize([ x for x in yaml.safe_load_all(stream)])
print(f'Output table shape with iterator: {df1.shape}')
print(f'Output table shape with iterator materialized as list: {df2.shape}')
我希望两个结果相同,但我得到:
Output table shape with iterator: (5, 18)
Output table shape with iterator materialized as list: (6, 18)
知道为什么这些结果不同吗?
list comprehension vs. generator expressions 查看此站点。
df1
缺少第一行数据,因为您传递的是 iterator 而不是 iterable。
print(yaml.safe_load_all(stream))
#Output: <generator object load_all at 0x00000293E1697750>
来自 pandas docs,需要一个列表:
data : dict or list of dicts
更新更多细节:
通过查看 normalize.py
源文件,函数 json_normalize
进行了条件检查,因此您的生成器被视为您传入了嵌套结构:
if any([isinstance(x, dict)
for x in compat.itervalues(y)] for y in data):
# naive normalization, this is idempotent for flat records
# and potentially will inflate the data considerably for
# deeply nested structures:
# {VeryLong: { b: 1,c:2}} -> {VeryLong.b:1 ,VeryLong.c:@}
#
# TODO: handle record value which are lists, at least error
# reasonably
data = nested_to_record(data, sep=sep)
return DataFrame(data)
nested_to_record
函数内部:
new_d = copy.deepcopy(d)
for k, v in d.items():
# each key gets renamed with prefix
if not isinstance(k, compat.string_types):
k = str(k)
if level == 0:
newkey = k
else:
newkey = prefix + sep + k
# only dicts gets recurse-flattend
# only at level>1 do we rename the rest of the keys
if not isinstance(v, dict):
if level != 0: # so we skip copying for top level, common case
v = new_d.pop(k)
new_d[newkey] = v
continue
else:
v = new_d.pop(k)
new_d.update(nested_to_record(v, newkey, sep, level + 1))
new_ds.append(new_d)
行 d.items()
是你的生成器被评估的地方,在循环内部你可以看到他们跳过第一个 "level",在你的例子中是第一条记录。
我试图将一个多文档 yaml 文件(即由多个 yaml 文档组成的 yaml 文件,用“---”分隔)加载到 Pandas 数据帧中。由于某种原因,第一个文档并没有出现在数据框中。如果 yaml.safe_load_all
的输出首先具体化为一个列表(而不是将迭代器提供给 pd.io.json.json_normalize
),所有文档最终都会出现在数据框中。我可以使用下面的示例代码(在完全不同的 yaml 文件上)重现它。
import os
import yaml
import pandas as pd
import urllib.request
# public example of multi-document yaml
inputfilepath = os.path.expanduser("~/my_example.yaml")
url = "https://raw.githubusercontent.com/kubernetes/examples/master/guestbook/all-in-one/guestbook-all-in-one.yaml"
urllib.request.urlretrieve(url, inputfilepath)
with open(inputfilepath, 'r') as stream:
df1 = pd.io.json.json_normalize(yaml.safe_load_all(stream))
with open(inputfilepath, 'r') as stream:
df2 = pd.io.json.json_normalize([ x for x in yaml.safe_load_all(stream)])
print(f'Output table shape with iterator: {df1.shape}')
print(f'Output table shape with iterator materialized as list: {df2.shape}')
我希望两个结果相同,但我得到:
Output table shape with iterator: (5, 18)
Output table shape with iterator materialized as list: (6, 18)
知道为什么这些结果不同吗?
list comprehension vs. generator expressions 查看此站点。
df1
缺少第一行数据,因为您传递的是 iterator 而不是 iterable。
print(yaml.safe_load_all(stream))
#Output: <generator object load_all at 0x00000293E1697750>
来自 pandas docs,需要一个列表:
data : dict or list of dicts
更新更多细节:
通过查看 normalize.py
源文件,函数 json_normalize
进行了条件检查,因此您的生成器被视为您传入了嵌套结构:
if any([isinstance(x, dict)
for x in compat.itervalues(y)] for y in data):
# naive normalization, this is idempotent for flat records
# and potentially will inflate the data considerably for
# deeply nested structures:
# {VeryLong: { b: 1,c:2}} -> {VeryLong.b:1 ,VeryLong.c:@}
#
# TODO: handle record value which are lists, at least error
# reasonably
data = nested_to_record(data, sep=sep)
return DataFrame(data)
nested_to_record
函数内部:
new_d = copy.deepcopy(d)
for k, v in d.items():
# each key gets renamed with prefix
if not isinstance(k, compat.string_types):
k = str(k)
if level == 0:
newkey = k
else:
newkey = prefix + sep + k
# only dicts gets recurse-flattend
# only at level>1 do we rename the rest of the keys
if not isinstance(v, dict):
if level != 0: # so we skip copying for top level, common case
v = new_d.pop(k)
new_d[newkey] = v
continue
else:
v = new_d.pop(k)
new_d.update(nested_to_record(v, newkey, sep, level + 1))
new_ds.append(new_d)
行 d.items()
是你的生成器被评估的地方,在循环内部你可以看到他们跳过第一个 "level",在你的例子中是第一条记录。