有没有一种方法可以遍历并基于单个列值并将一个值标记到 Pandas 中的多个新列中?
Is there a way to loop through and based a single column value and mark a value into multiple new columns in Pandas?
数据框看起来类似于:
start = [0,2,4,5,1]
end = [3,5,5,5,2]
df = pd.DataFrame({'start': start,'end': end})
我想要的结果是这样的:
基本上跨多个列从头到尾标记一个值。因此,如果从 0 开始到 3 结束,我想用值(1)标记新列 0 到 3,其余的标记为 0.
start = [0,2,4,5,1]
end = [3,5,5,5,2]
diff = [3,3,1,0,1]
col_0 = [1,0,0,0,0]
col_1=[1,0,0,0,1]
col_2 = [1,1,0,0,1]
col_3=[1,1,0,0,0]
col_4=[0,1,1,0,0]
col_5=[0,1,1,1,0]
df = pd.DataFrame({'start': start,'end': end, 'col_0':col_0, 'col_1': col_1, 'col_2': col_2, 'col_3':col_3, 'col_4': col_4, 'col_5': col_5})
start end col_0 col_1 col_2 col_3 col_4 col_5
0 3 1 1 1 1 0 0
2 5 0 0 1 1 1 1
4 5 0 0 0 0 1 1
5 5 0 0 0 0 0 1
1 2 0 1 1 0 0 0
将您的范围从 start
到 stop
转换为索引列表,然后展开它。最后,使用索引将值设置为 1:
import numpy as np
range_to_ind = lambda x: range(x['start'], x['end']+1)
(i, j) = df.apply(range_to_ind, axis=1).explode().astype(int).reset_index().values.T
a = np.zeros((df.shape[0], max(df['end'])+1), dtype=int)
a[i, j] = 1
df = df.join(pd.DataFrame(a).add_prefix('col_'))
输出:
>>> df
start end col_0 col_1 col_2 col_3 col_4 col_5
0 0 3 1 1 1 1 0 0
1 2 5 0 0 1 1 1 1
2 4 5 0 0 0 0 1 1
3 5 5 0 0 0 0 0 1
4 1 2 0 1 1 0 0 0
在 DataFrame
中的每一行的列表理解中使用 dict.fromkeys
,如果性能很重要,则传递给 DataFrame 构造函数:
L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]
df = df.join(pd.DataFrame(L, index=df.index).add_prefix('col_').fillna(0).astype(int))
print (df)
start end col_0 col_1 col_2 col_3 col_4 col_5
0 0 3 1 1 1 1 0 0
1 2 5 0 0 1 1 1 1
2 4 5 0 0 0 0 1 1
3 5 5 0 0 0 0 0 1
4 1 2 0 1 1 0 0 0
如果可能,某些范围值丢失,例如在更改的示例数据中添加 DataFrame.reindex
:
#missing column 6
start = [0,2,4,7,1]
end = [3,5,5,8,2]
df = pd.DataFrame({'start': start,'end': end})
L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]
df1 = (pd.DataFrame(L, index=df.index)
.reindex(columns=range(df['start'].min(), df['end'].max() + 1), fill_value=0)
.add_prefix('col_')
.fillna(0)
.astype(int))
df = df.join(df1)
print (df)
start end col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8
0 0 3 1 1 1 1 0 0 0 0 0
1 2 5 0 0 1 1 1 1 0 0 0
2 4 5 0 0 0 0 1 1 0 0 0
3 7 8 0 0 0 0 0 0 0 1 1
4 1 2 0 1 1 0 0 0 0 0 0
编辑:计算小时使用:
start = pd.to_datetime([0,2,4,5,1], format='%H')
end = pd.to_datetime([3,5,5,5,2], format='%H')
df = pd.DataFrame({'start': start,'end': end})
df.loc[[0,1], 'end'] += pd.Timedelta(1, 'day')
#list for hours datetimes
L = [dict.fromkeys(pd.date_range(s, e, freq='H'), 1) for s, e in zip(df['start'], df['end'])]
df1 = pd.DataFrame(L, index=df.index)
#aggregate sum by hours in columns
df1 = df1.groupby(df1.columns.hour, axis=1).sum().astype(int)
print (df1)
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 \
0 2 2 2 2 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
1 1 1 2 2 2 2 1 1 1 1 ... 1 1 1 1 1 1 1
2 0 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0
4 0 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
21 22 23
0 1 1 1
1 1 1 1
2 0 0 0
3 0 0 0
4 0 0 0
[5 rows x 24 columns]
数据框看起来类似于:
start = [0,2,4,5,1]
end = [3,5,5,5,2]
df = pd.DataFrame({'start': start,'end': end})
我想要的结果是这样的: 基本上跨多个列从头到尾标记一个值。因此,如果从 0 开始到 3 结束,我想用值(1)标记新列 0 到 3,其余的标记为 0.
start = [0,2,4,5,1]
end = [3,5,5,5,2]
diff = [3,3,1,0,1]
col_0 = [1,0,0,0,0]
col_1=[1,0,0,0,1]
col_2 = [1,1,0,0,1]
col_3=[1,1,0,0,0]
col_4=[0,1,1,0,0]
col_5=[0,1,1,1,0]
df = pd.DataFrame({'start': start,'end': end, 'col_0':col_0, 'col_1': col_1, 'col_2': col_2, 'col_3':col_3, 'col_4': col_4, 'col_5': col_5})
start end col_0 col_1 col_2 col_3 col_4 col_5
0 3 1 1 1 1 0 0
2 5 0 0 1 1 1 1
4 5 0 0 0 0 1 1
5 5 0 0 0 0 0 1
1 2 0 1 1 0 0 0
将您的范围从 start
到 stop
转换为索引列表,然后展开它。最后,使用索引将值设置为 1:
import numpy as np
range_to_ind = lambda x: range(x['start'], x['end']+1)
(i, j) = df.apply(range_to_ind, axis=1).explode().astype(int).reset_index().values.T
a = np.zeros((df.shape[0], max(df['end'])+1), dtype=int)
a[i, j] = 1
df = df.join(pd.DataFrame(a).add_prefix('col_'))
输出:
>>> df
start end col_0 col_1 col_2 col_3 col_4 col_5
0 0 3 1 1 1 1 0 0
1 2 5 0 0 1 1 1 1
2 4 5 0 0 0 0 1 1
3 5 5 0 0 0 0 0 1
4 1 2 0 1 1 0 0 0
在 DataFrame
中的每一行的列表理解中使用 dict.fromkeys
,如果性能很重要,则传递给 DataFrame 构造函数:
L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]
df = df.join(pd.DataFrame(L, index=df.index).add_prefix('col_').fillna(0).astype(int))
print (df)
start end col_0 col_1 col_2 col_3 col_4 col_5
0 0 3 1 1 1 1 0 0
1 2 5 0 0 1 1 1 1
2 4 5 0 0 0 0 1 1
3 5 5 0 0 0 0 0 1
4 1 2 0 1 1 0 0 0
如果可能,某些范围值丢失,例如在更改的示例数据中添加 DataFrame.reindex
:
#missing column 6
start = [0,2,4,7,1]
end = [3,5,5,8,2]
df = pd.DataFrame({'start': start,'end': end})
L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]
df1 = (pd.DataFrame(L, index=df.index)
.reindex(columns=range(df['start'].min(), df['end'].max() + 1), fill_value=0)
.add_prefix('col_')
.fillna(0)
.astype(int))
df = df.join(df1)
print (df)
start end col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8
0 0 3 1 1 1 1 0 0 0 0 0
1 2 5 0 0 1 1 1 1 0 0 0
2 4 5 0 0 0 0 1 1 0 0 0
3 7 8 0 0 0 0 0 0 0 1 1
4 1 2 0 1 1 0 0 0 0 0 0
编辑:计算小时使用:
start = pd.to_datetime([0,2,4,5,1], format='%H')
end = pd.to_datetime([3,5,5,5,2], format='%H')
df = pd.DataFrame({'start': start,'end': end})
df.loc[[0,1], 'end'] += pd.Timedelta(1, 'day')
#list for hours datetimes
L = [dict.fromkeys(pd.date_range(s, e, freq='H'), 1) for s, e in zip(df['start'], df['end'])]
df1 = pd.DataFrame(L, index=df.index)
#aggregate sum by hours in columns
df1 = df1.groupby(df1.columns.hour, axis=1).sum().astype(int)
print (df1)
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 \
0 2 2 2 2 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
1 1 1 2 2 2 2 1 1 1 1 ... 1 1 1 1 1 1 1
2 0 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0
4 0 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
21 22 23
0 1 1 1
1 1 1 1
2 0 0 0
3 0 0 0
4 0 0 0
[5 rows x 24 columns]