有没有一种方法可以遍历并基于单个列值并将一个值标记到 Pandas 中的多个新列中？

Question

数据框看起来类似于：

start = [0,2,4,5,1]
end = [3,5,5,5,2]
df = pd.DataFrame({'start': start,'end': end})

我想要的结果是这样的：基本上跨多个列从头到尾标记一个值。因此，如果从 0 开始到 3 结束，我想用值（1）标记新列 0 到 3，其余的标记为 0.

start = [0,2,4,5,1]
end = [3,5,5,5,2]
diff = [3,3,1,0,1]
col_0 = [1,0,0,0,0]
col_1=[1,0,0,0,1]
col_2 = [1,1,0,0,1]
col_3=[1,1,0,0,0]
col_4=[0,1,1,0,0]
col_5=[0,1,1,1,0]

df = pd.DataFrame({'start': start,'end': end, 'col_0':col_0, 'col_1': col_1, 'col_2': col_2, 'col_3':col_3, 'col_4': col_4, 'col_5': col_5})

start   end  col_0  col_1   col_2   col_3   col_4   col_5
0        3    1      1        1      1        0       0
2        5    0      0        1      1        1       1
4        5    0      0        0      0        1       1
5        5    0      0        0      0        0       1
1        2    0      1        1      0        0       0

Answer 1

将您的范围从 start 到 stop 转换为索引列表，然后展开它。最后，使用索引将值设置为 1:

import numpy as np

range_to_ind = lambda x: range(x['start'], x['end']+1)
(i, j) = df.apply(range_to_ind, axis=1).explode().astype(int).reset_index().values.T

a = np.zeros((df.shape[0], max(df['end'])+1), dtype=int)
a[i, j] = 1

df = df.join(pd.DataFrame(a).add_prefix('col_'))

输出：

>>> df
   start  end  col_0  col_1  col_2  col_3  col_4  col_5
0      0    3      1      1      1      1      0      0
1      2    5      0      0      1      1      1      1
2      4    5      0      0      0      0      1      1
3      5    5      0      0      0      0      0      1
4      1    2      0      1      1      0      0      0

Answer 2

在 DataFrame 中的每一行的列表理解中使用 dict.fromkeys，如果性能很重要，则传递给 DataFrame 构造函数：

L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]

df = df.join(pd.DataFrame(L, index=df.index).add_prefix('col_').fillna(0).astype(int))
print (df)
   start  end  col_0  col_1  col_2  col_3  col_4  col_5
0      0    3      1      1      1      1      0      0
1      2    5      0      0      1      1      1      1
2      4    5      0      0      0      0      1      1
3      5    5      0      0      0      0      0      1
4      1    2      0      1      1      0      0      0

如果可能，某些范围值丢失，例如在更改的示例数据中添加 DataFrame.reindex:

#missing column 6
start = [0,2,4,7,1]
end = [3,5,5,8,2]
df = pd.DataFrame({'start': start,'end': end})

L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]

df1 = (pd.DataFrame(L, index=df.index)
         .reindex(columns=range(df['start'].min(), df['end'].max() + 1), fill_value=0)
         .add_prefix('col_')
         .fillna(0)
         .astype(int))

df = df.join(df1)
print (df)
   start  end  col_0  col_1  col_2  col_3  col_4  col_5  col_6  col_7  col_8
0      0    3      1      1      1      1      0      0      0      0      0
1      2    5      0      0      1      1      1      1      0      0      0
2      4    5      0      0      0      0      1      1      0      0      0
3      7    8      0      0      0      0      0      0      0      1      1
4      1    2      0      1      1      0      0      0      0      0      0

编辑：计算小时使用：

start = pd.to_datetime([0,2,4,5,1], format='%H')
end = pd.to_datetime([3,5,5,5,2], format='%H')
df = pd.DataFrame({'start': start,'end': end})
df.loc[[0,1], 'end'] += pd.Timedelta(1, 'day')

#list for hours datetimes
L = [dict.fromkeys(pd.date_range(s, e, freq='H'), 1) for s, e in zip(df['start'], df['end'])]

df1 = pd.DataFrame(L, index=df.index)

#aggregate sum by hours in columns
df1 = df1.groupby(df1.columns.hour, axis=1).sum().astype(int)

print (df1)
   0   1   2   3   4   5   6   7   8   9   ...  14  15  16  17  18  19  20  \
0   2   2   2   2   1   1   1   1   1   1  ...   1   1   1   1   1   1   1   
1   1   1   2   2   2   2   1   1   1   1  ...   1   1   1   1   1   1   1   
2   0   0   0   0   1   1   0   0   0   0  ...   0   0   0   0   0   0   0   
3   0   0   0   0   0   1   0   0   0   0  ...   0   0   0   0   0   0   0   
4   0   1   1   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   

   21  22  23  
0   1   1   1  
1   1   1   1  
2   0   0   0  
3   0   0   0  
4   0   0   0  

[5 rows x 24 columns]

有没有一种方法可以遍历并基于单个列值并将一个值标记到 Pandas 中的多个新列中？

Is there a way to loop through and based a single column value and mark a value into multiple new columns in Pandas?

iteration

dataframe

pandas