将嵌套列表转换为具有列名的 pandas 数据框
Converting nested list to pandas dataframe with column names
Image of Original DataFrame
我有一个看起来像这样的嵌套列表。
features =
[['0:0.084556', '1:0.138594', '2:0.094304\n'],
['0:0.101468', '4:0.138594', '5:0.377215\n'],
['0:0.135290', '2:0.277187', '3:0.141456\n']
]
嵌套列表中的每个列表都是以逗号分隔的一行。 “:”左边是列名,右边是行值。
我想将其转换为 pandas 数据框,如下所示:
f_0000 | f_0001 | f_0002 | f_0003 | f_0004 | f_0005
---------------------------------------------------------------
0.084556 | 0.138594 | 0.094304 | 0.000000 | 0.000000 | 0.000000
0.101468 | 0.000000 | 0.000000 | 0.000000 | 0.138594 | 0.377215
0.135290 | 0.000000 | 0.277187 | 0.141456 | 0.000000 | 0.000000
有人可以帮我解决这个问题吗?
原始 DF(但是 pd.read_clipboard 对我来说格式不正确..)
ex_id labels features
0 0 446,521,1149,1249,1265,1482 0:0.084556 1:0.138594 2:0.094304 3:0.195764 4:...
1 1 78,80,85,86 0:0.050734 1:0.762265 2:0.754431 3:0.065255 4:...
2 2 457,577,579,640,939,1158 0:0.101468 1:0.138594 2:0.377215 3:0.130509 4:...
3 3 172,654,693,1704 0:0.186024 1:0.346484 2:0.141456 3:0.195764 4:...
4 4 403,508,1017,1052,1731,3183 0:0.135290 1:0.277187 2:0.141456 3:0.065255 4:...
试试这个:
df = pd.DataFrame(data, columns = ['Column name 1'], ['column name 2'])
我认为简单的将保持 for
循环。
首先,select 来自给定 features
的所有键。
第一个总结在:
keys = sorted(list(set([elt.split(':')[0] for l in features for elt in l])))
- 从上述键创建一个空
dict
并用空列表初始化所有键:
data = {k:[] for k in keys}
迭代所有特征:
- 将访问过的所有关键特征保存在
seen
变量中
- 添加所有特色键和值
- 用当前不存在的键完成数据
features
最终,使用默认构造函数 [pd.DataFrame()
] (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).
从字典创建数据框
使用 .columns
and string formatting (format
). Here 正确格式化列名是一些很好的解释。
说的够多了,这里是完整代码+插图:
features = [["0:0.084556", "1:0.138594", "2:0.094304"],
["0:0.101468", "4:0.138594", "5:0.377215"],
["0:0.135290", "2:0.277187", "3:0.141456"]
]
# Step 1
keys = sorted(list(set([elt.split(':')[0] for l in features for elt in l])))
print(keys)
# ['0', '1', '2', '3', '4', '5']
# Step 2
data = {k:[] for k in keys}
print(data)
# {'0': [], '1': [], '2': [], '3': [], '4': [], '5': []}
# Step 3
for sub in features:
# Step 3.1
seen = []
# Step 3.2
for l in sub:
k2, v = l.split(":") # Get key and value
data[k2].append(float(v)) # Append current value to data
seen.append(k2) # Set the key as seen
# Step 3.3
for k in keys: # For all data keys
if k not in seen: # If not seen
data[k].append(0) # Add 0
print(data)
# {'0': [0.084556, 0.101468, 0.13529],
# '1': [0.138594, 0, 0],
# '2': [0.094304, 0,0.277187],
# '3': [0, 0, 0.141456],
# '4': [0, 0.138594, 0],
# '5': [0, 0.377215, 0]
# }
# Step 4
df = pd.DataFrame(data)
print(df)
# 0 1 2 3 4 5
# 0 0.084556 0.138594 0.094304 0.000000 0.000000 0.000000
# 1 0.101468 0.000000 0.000000 0.000000 0.138594 0.377215
# 2 0.135290 0.000000 0.277187 0.141456 0.000000 0.000000
# Step 5
df.columns = ["f_{:04d}".format(int(val)) for val in df.columns]
print(df)
# f_0000 f_0001 f_0002 f_0003 f_0004 f_0005
# 0 0.084556 0.138594 0.094304 0.000000 0.000000 0.000000
# 1 0.101468 0.000000 0.000000 0.000000 0.138594 0.377215
# 2 0.135290 0.000000 0.277187 0.141456 0.000000 0.000000
Image of Original DataFrame
我有一个看起来像这样的嵌套列表。
features =
[['0:0.084556', '1:0.138594', '2:0.094304\n'],
['0:0.101468', '4:0.138594', '5:0.377215\n'],
['0:0.135290', '2:0.277187', '3:0.141456\n']
]
嵌套列表中的每个列表都是以逗号分隔的一行。 “:”左边是列名,右边是行值。
我想将其转换为 pandas 数据框,如下所示:
f_0000 | f_0001 | f_0002 | f_0003 | f_0004 | f_0005
---------------------------------------------------------------
0.084556 | 0.138594 | 0.094304 | 0.000000 | 0.000000 | 0.000000
0.101468 | 0.000000 | 0.000000 | 0.000000 | 0.138594 | 0.377215
0.135290 | 0.000000 | 0.277187 | 0.141456 | 0.000000 | 0.000000
有人可以帮我解决这个问题吗?
原始 DF(但是 pd.read_clipboard 对我来说格式不正确..)
ex_id labels features
0 0 446,521,1149,1249,1265,1482 0:0.084556 1:0.138594 2:0.094304 3:0.195764 4:...
1 1 78,80,85,86 0:0.050734 1:0.762265 2:0.754431 3:0.065255 4:...
2 2 457,577,579,640,939,1158 0:0.101468 1:0.138594 2:0.377215 3:0.130509 4:...
3 3 172,654,693,1704 0:0.186024 1:0.346484 2:0.141456 3:0.195764 4:...
4 4 403,508,1017,1052,1731,3183 0:0.135290 1:0.277187 2:0.141456 3:0.065255 4:...
试试这个:
df = pd.DataFrame(data, columns = ['Column name 1'], ['column name 2'])
我认为简单的将保持 for
循环。
首先,select 来自给定
features
的所有键。
第一个总结在:
keys = sorted(list(set([elt.split(':')[0] for l in features for elt in l])))
- 从上述键创建一个空
dict
并用空列表初始化所有键:
data = {k:[] for k in keys}
迭代所有特征:
- 将访问过的所有关键特征保存在
seen
变量中 - 添加所有特色键和值
- 用当前不存在的键完成数据
features
- 将访问过的所有关键特征保存在
最终,使用默认构造函数 [
pd.DataFrame()
] (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). 从字典创建数据框
使用
.columns
and string formatting (format
). Here 正确格式化列名是一些很好的解释。
说的够多了,这里是完整代码+插图:
features = [["0:0.084556", "1:0.138594", "2:0.094304"],
["0:0.101468", "4:0.138594", "5:0.377215"],
["0:0.135290", "2:0.277187", "3:0.141456"]
]
# Step 1
keys = sorted(list(set([elt.split(':')[0] for l in features for elt in l])))
print(keys)
# ['0', '1', '2', '3', '4', '5']
# Step 2
data = {k:[] for k in keys}
print(data)
# {'0': [], '1': [], '2': [], '3': [], '4': [], '5': []}
# Step 3
for sub in features:
# Step 3.1
seen = []
# Step 3.2
for l in sub:
k2, v = l.split(":") # Get key and value
data[k2].append(float(v)) # Append current value to data
seen.append(k2) # Set the key as seen
# Step 3.3
for k in keys: # For all data keys
if k not in seen: # If not seen
data[k].append(0) # Add 0
print(data)
# {'0': [0.084556, 0.101468, 0.13529],
# '1': [0.138594, 0, 0],
# '2': [0.094304, 0,0.277187],
# '3': [0, 0, 0.141456],
# '4': [0, 0.138594, 0],
# '5': [0, 0.377215, 0]
# }
# Step 4
df = pd.DataFrame(data)
print(df)
# 0 1 2 3 4 5
# 0 0.084556 0.138594 0.094304 0.000000 0.000000 0.000000
# 1 0.101468 0.000000 0.000000 0.000000 0.138594 0.377215
# 2 0.135290 0.000000 0.277187 0.141456 0.000000 0.000000
# Step 5
df.columns = ["f_{:04d}".format(int(val)) for val in df.columns]
print(df)
# f_0000 f_0001 f_0002 f_0003 f_0004 f_0005
# 0 0.084556 0.138594 0.094304 0.000000 0.000000 0.000000
# 1 0.101468 0.000000 0.000000 0.000000 0.138594 0.377215
# 2 0.135290 0.000000 0.277187 0.141456 0.000000 0.000000