在 pandas 数据框中将一行分解为多行
Explode a row to multiple rows in pandas dataframe
我有一个包含以下内容的数据框 header:
id, type1, ..., type10, location1, ..., location10
我想按如下方式转换它:
id, type, location
我设法使用嵌入式 for 循环做到了这一点,但速度很慢:
new_format_columns = ['ID', 'type', 'location']
new_format_dataframe = pd.DataFrame(columns=new_format_columns)
print(data.head())
new_index = 0
for index, row in data.iterrows():
ID = row["ID"]
for i in range(1,11):
if row["type"+str(i)] == np.nan:
continue
else:
new_row = pd.Series([ID, row["type"+str(i)], row["location"+str(i)]])
new_format_dataframe.loc[new_index] = new_row.values
new_index += 1
关于使用本机 pandas 功能进行改进的任何建议?
您可以使用 lreshape
:
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
样本:
import pandas as pd
df = pd.DataFrame({
'type1': {0: 1, 1: 4},
'id': {0: 'a', 1: 'a'},
'type10': {0: 1, 1: 8},
'location1': {0: 2, 1: 9},
'location10': {0: 5, 1: 7}})
print (df)
id location1 location10 type1 type10
0 a 2 5 1 1
1 a 9 7 4 8
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
id Location Type
0 a 2 1
1 a 9 4
2 a 5 1
3 a 7 8
另一个双melt
的解决方案:
print (pd.concat([pd.melt(df, id_vars='id', value_vars=types, value_name='type'),
pd.melt(df, value_vars=location, value_name='Location')], axis=1)
.drop('variable', axis=1))
id type Location
0 a 1 2
1 a 4 9
2 a 1 5
3 a 8 7
编辑:
lreshape
is now undocumented, but is possible in future will by removed (with pd.wide_to_long too)。
可能的解决方案是将所有 3 个功能合并为一个 - 也许 melt
,但现在尚未实施。也许在 pandas 的某些新版本中。然后我的回答会更新。
我有一个包含以下内容的数据框 header:
id, type1, ..., type10, location1, ..., location10
我想按如下方式转换它:
id, type, location
我设法使用嵌入式 for 循环做到了这一点,但速度很慢:
new_format_columns = ['ID', 'type', 'location']
new_format_dataframe = pd.DataFrame(columns=new_format_columns)
print(data.head())
new_index = 0
for index, row in data.iterrows():
ID = row["ID"]
for i in range(1,11):
if row["type"+str(i)] == np.nan:
continue
else:
new_row = pd.Series([ID, row["type"+str(i)], row["location"+str(i)]])
new_format_dataframe.loc[new_index] = new_row.values
new_index += 1
关于使用本机 pandas 功能进行改进的任何建议?
您可以使用 lreshape
:
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
样本:
import pandas as pd
df = pd.DataFrame({
'type1': {0: 1, 1: 4},
'id': {0: 'a', 1: 'a'},
'type10': {0: 1, 1: 8},
'location1': {0: 2, 1: 9},
'location10': {0: 5, 1: 7}})
print (df)
id location1 location10 type1 type10
0 a 2 5 1 1
1 a 9 7 4 8
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
id Location Type
0 a 2 1
1 a 9 4
2 a 5 1
3 a 7 8
另一个双melt
的解决方案:
print (pd.concat([pd.melt(df, id_vars='id', value_vars=types, value_name='type'),
pd.melt(df, value_vars=location, value_name='Location')], axis=1)
.drop('variable', axis=1))
id type Location
0 a 1 2
1 a 4 9
2 a 1 5
3 a 8 7
编辑:
lreshape
is now undocumented, but is possible in future will by removed (with pd.wide_to_long too)。
可能的解决方案是将所有 3 个功能合并为一个 - 也许 melt
,但现在尚未实施。也许在 pandas 的某些新版本中。然后我的回答会更新。