Numpy:从 pandas 数据帧创建可变长度序列
Numpy: Creation of a variable length sequence from a pandas data-frame
假设我有以下数据框:
df_raw = pd.DataFrame({"person_id": [101, 101, 102, 102, 102, 103], "date": [0, 5, 0, 7, 11, 0], "val1": [99, 11, 22, 33, 44, 22], "val2": [77, 88, 22, 66, 55, 33]})
我想要实现的是创建一个 3 维 numpy 数组,结果应如下所示:
np_pros = np.array([[[0, 99, 77], [5, 11, 88]], [[0, 22, 22], [7, 33, 66], [11, 44, 55]], [[0, 22, 33]]])
换句话说,3D数组应该具有以下形状[unique_ids, None, feature_size]
。在我的例子中,unique_ids
的数量是 3,feature size
是 3(除 person_id
之外的所有列),y
列的长度可变,它表示person_id
.
的测量次数
我很清楚我可以创建一个 np.zeros((unique_ids, max_num_features, feature_size))
数组,填充它然后删除我不需要但我想要更快的东西的元素。原因是我的实际数据框很大(大约 [50000, 455]
),这将导致一个大约 [12500, 200, 455].
的 numpy 数组
期待您的回答!
这是一种方法:
ix = np.flatnonzero(df1.person_id != df1.person_id.shift(1))
np.split(df1.drop('person_id', axis=1).values, ix[1:])
[array([[ 0, 99, 77],
[ 5, 11, 88]], dtype=int64),
array([[ 0, 22, 22],
[ 7, 33, 66],
[11, 44, 55]], dtype=int64),
array([[ 0, 22, 33]], dtype=int64)]
详情
使用 np.flatnonzero
after comparing df1
with a shifted version of itself (pd.shift
) 以获得 person_id
发生变化的索引:
ix = np.flatnonzero(df1.person_id != df1.person_id.shift(1))
#array([0, 2, 5])
使用np.split
根据获取的索引拆分数据框的感兴趣列:
np.split(df1.drop('person_id', axis=1).values, ix[1:])
[array([[ 0, 99, 77],
[ 5, 11, 88]], dtype=int64),
array([[ 0, 22, 22],
[ 7, 33, 66],
[11, 44, 55]], dtype=int64),
array([[ 0, 22, 33]], dtype=int64)]
您可以使用 groupby:
import pandas as pd
df_raw = pd.DataFrame({"person_id": [101, 101, 102, 102, 102, 103], "date": [0, 5, 0, 7, 11, 0], "val1": [99, 11, 22, 33, 44, 22], "val2": [77, 88, 22, 66, 55, 33]})
result = [group.values for _, group in df_raw.groupby('person_id')[['date', 'val1', 'val2']]]
print(result)
输出
[array([[ 0, 101, 99, 77],
[ 5, 101, 11, 88]]), array([[ 0, 102, 22, 22],
[ 7, 102, 33, 66],
[ 11, 102, 44, 55]]), array([[ 0, 103, 22, 33]])]
另一个解决方案 xarray
让我们 创建由 person_id
的口是心非暗示的维度
>>> df['newdim'] = df.person_id.duplicated()
>>> df.newdim = df.groupby('person_id').newdim.cumsum()
>>> df = df.set_index(["newdim", "person_id"])
>>> df
date val1 val2
newdim person_id
0.0 101 0 99 77
1.0 101 5 11 88
0.0 102 0 22 22
1.0 102 7 33 66
2.0 102 11 44 55
0.0 103 0 22 33
为了可读性,我们可能想把df
变成一个xarray.Dataset
对象
>>> xa = df.to_xarray()
>>> xa
<xarray.Dataset>
Dimensions: (newdim: 3, person_id: 3)
Coordinates:
* newdim (newdim) float64 0.0 1.0 2.0
* person_id (person_id) int64 101 102 103
Data variables:
date (newdim, person_id) float64 0.0 0.0 0.0 5.0 7.0 nan nan 11.0 nan
val1 (newdim, person_id) float64 99.0 22.0 22.0 11.0 33.0 nan nan ...
val2 (newdim, person_id) float64 77.0 22.0 33.0 88.0 66.0 nan nan ...
然后进入维度健康 numpy 数组
>>> ar = xa.to_array().T.values
>>> ar
array([[[ 0., 99., 77.],
[ 5., 11., 88.],
[nan, nan, nan]],
[[ 0., 22., 22.],
[ 7., 33., 66.],
[11., 44., 55.]],
[[ 0., 22., 33.],
[nan, nan, nan],
[nan, nan, nan]]])
请注意,nan
-值是通过强制引入的。
假设我有以下数据框:
df_raw = pd.DataFrame({"person_id": [101, 101, 102, 102, 102, 103], "date": [0, 5, 0, 7, 11, 0], "val1": [99, 11, 22, 33, 44, 22], "val2": [77, 88, 22, 66, 55, 33]})
我想要实现的是创建一个 3 维 numpy 数组,结果应如下所示:
np_pros = np.array([[[0, 99, 77], [5, 11, 88]], [[0, 22, 22], [7, 33, 66], [11, 44, 55]], [[0, 22, 33]]])
换句话说,3D数组应该具有以下形状[unique_ids, None, feature_size]
。在我的例子中,unique_ids
的数量是 3,feature size
是 3(除 person_id
之外的所有列),y
列的长度可变,它表示person_id
.
我很清楚我可以创建一个 np.zeros((unique_ids, max_num_features, feature_size))
数组,填充它然后删除我不需要但我想要更快的东西的元素。原因是我的实际数据框很大(大约 [50000, 455]
),这将导致一个大约 [12500, 200, 455].
期待您的回答!
这是一种方法:
ix = np.flatnonzero(df1.person_id != df1.person_id.shift(1))
np.split(df1.drop('person_id', axis=1).values, ix[1:])
[array([[ 0, 99, 77],
[ 5, 11, 88]], dtype=int64),
array([[ 0, 22, 22],
[ 7, 33, 66],
[11, 44, 55]], dtype=int64),
array([[ 0, 22, 33]], dtype=int64)]
详情
使用 np.flatnonzero
after comparing df1
with a shifted version of itself (pd.shift
) 以获得 person_id
发生变化的索引:
ix = np.flatnonzero(df1.person_id != df1.person_id.shift(1))
#array([0, 2, 5])
使用np.split
根据获取的索引拆分数据框的感兴趣列:
np.split(df1.drop('person_id', axis=1).values, ix[1:])
[array([[ 0, 99, 77],
[ 5, 11, 88]], dtype=int64),
array([[ 0, 22, 22],
[ 7, 33, 66],
[11, 44, 55]], dtype=int64),
array([[ 0, 22, 33]], dtype=int64)]
您可以使用 groupby:
import pandas as pd
df_raw = pd.DataFrame({"person_id": [101, 101, 102, 102, 102, 103], "date": [0, 5, 0, 7, 11, 0], "val1": [99, 11, 22, 33, 44, 22], "val2": [77, 88, 22, 66, 55, 33]})
result = [group.values for _, group in df_raw.groupby('person_id')[['date', 'val1', 'val2']]]
print(result)
输出
[array([[ 0, 101, 99, 77],
[ 5, 101, 11, 88]]), array([[ 0, 102, 22, 22],
[ 7, 102, 33, 66],
[ 11, 102, 44, 55]]), array([[ 0, 103, 22, 33]])]
另一个解决方案 xarray
让我们 创建由
person_id
的口是心非暗示的维度
>>> df['newdim'] = df.person_id.duplicated()
>>> df.newdim = df.groupby('person_id').newdim.cumsum()
>>> df = df.set_index(["newdim", "person_id"])
>>> df
date val1 val2
newdim person_id
0.0 101 0 99 77
1.0 101 5 11 88
0.0 102 0 22 22
1.0 102 7 33 66
2.0 102 11 44 55
0.0 103 0 22 33
为了可读性,我们可能想把df
变成一个xarray.Dataset
对象
>>> xa = df.to_xarray()
>>> xa
<xarray.Dataset>
Dimensions: (newdim: 3, person_id: 3)
Coordinates:
* newdim (newdim) float64 0.0 1.0 2.0
* person_id (person_id) int64 101 102 103
Data variables:
date (newdim, person_id) float64 0.0 0.0 0.0 5.0 7.0 nan nan 11.0 nan
val1 (newdim, person_id) float64 99.0 22.0 22.0 11.0 33.0 nan nan ...
val2 (newdim, person_id) float64 77.0 22.0 33.0 88.0 66.0 nan nan ...
然后进入维度健康 numpy 数组
>>> ar = xa.to_array().T.values
>>> ar
array([[[ 0., 99., 77.],
[ 5., 11., 88.],
[nan, nan, nan]],
[[ 0., 22., 22.],
[ 7., 33., 66.],
[11., 44., 55.]],
[[ 0., 22., 33.],
[nan, nan, nan],
[nan, nan, nan]]])
请注意,nan
-值是通过强制引入的。