Python: "Binning" 个子数组

Question

我正在寻求根据行的第一个元素对数据行进行分箱。

我的数据是这样的：

[[Temperature, value0, value1, ... value249]
 [Temperature, ...
]

这么说：每一行的第一个元素是一个温度值，其余的行是一个信号的时间轨迹。

我想制作一个这样形状的数组：

[Temperature-bin,[[values]
                  [values]
                     ... ]]
 Next Temp.-bin, [[values]
                  [values]
                     ... ]]
...
]

原始数据数组中的行应在相应温度箱的子数组中排序。

data= np.array([values]) # shape is [temp+250 timesteps,400K]
temp=data[0]

start=23000
end=380000

tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])

binsize=1
bincenters=np.arange(np.round(tempmin),np.round(tempmax)+1,binsize)

binneddata=np.empty([len(bincenters),2])

for i in np.arange(len(temp)):
    binneddata[i]=[bincenters[i],np.array([])]

我希望得到一个如上所述的结果数组，其中每一行都包含容器的平均温度 (bincenters[i]) 和时间轨迹数组。 Python 给我一个关于“用序列设置数组元素”的错误。我可以在之前的另一个脚本中创建这种由不同数据类型组成的数组，但是我必须专门定义它，在这种情况下这是不可能的，因为我正在处理几行 100K 数据行的文件.同时我想使用尽可能多的内置函数和尽可能少的循环，因为我的计算机已经需要一些时间来处理那个大小的文件。

感谢您的意见，

乐帕克

Answer 1

首先：感谢 kwinkunks 提示使用 pandas 数据框。我找到了使用此功能的解决方案。

装箱现在是这样完成的：

tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])

binsize=1
bincenters=np.array(np.arange(np.round(tempmin),np.round(tempmax)+1,binsize))
lowerbinedges=np.array(bincenters-binsize/2)
higherbinedges=np.array(bincenters+binsize/2)

allbinedges=np.append(lowerbinedges,higherbinedges[-1])

temp_pd=pd.Series(temp[start:end])
traces=pd.Series(list(data[start:end,0:250]))


tempbins=pd.cut(temp_pd,allbinedges,labels=bincenters)

df=pd.concat([temp_pd,tempbins,traces], keys=['Temp','Bincenter','Traces'], axis=1)

通过定义 bin（在本例中为偶数大小）。变量 "tempbins" 与 temp（"raw" 温度）具有相同的形状，并将每一行数据分配给某个 bin。

实际分析非常简短。开始于：

rf=pd.DataFrame({'Bincenter': bincenters})

resultframe ("rf") 从 bincenters 开始（作为稍后图中的 x 轴），并简单地添加所需结果的列。

与

df[df.Bincenter==xyz]

我只能 select 那些来自 df 的数据行，我想在 selected bin 中。

在我的例子中，我对实际时间轨迹不感兴趣，但对总和或平均值感兴趣，所以我使用 lambda 函数，运行通过 rf 的行并搜索每一行在 df 中，它在 "Bincenter" 中具有相同的值。

rf['Binsize']=rf.apply(lambda row: len(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
rf['Trace_sum']=rf.apply(lambda row: sum(df.Traces[df.Bincenter==row.Bincenter]), axis=1)

有了这些，另一列被添加到 resultframe rf 中，用于跟踪总和和 bin 中的行数。

我在 rf.Trace_sum 中执行了一些轨迹拟合，但我在 pandas 中没有执行。

不过，dataframe 在这里还是很有用的。我像这样使用 odr 进行拟合

for i in binnumber:
    fitdata=odr.Data(time[fitstart:],rf.Trace_sum.values[i][fitstart:])
    ... some more fit stuff here...

并将拟合结果保存在

lifetimefits=pd.DataFrame({'lifetime': fitresult[:,1], 'sd_lifetime':fitresult[:,4]})

最后用

将它们添加到结果框中

rf=pd.concat([rf,lifetimefits],axis=1)
rf[['Bincenter','Binsize','lifetime','sd_lifetime']].to_csv('results.csv', header=True, index=False)

这样的输出类似于

Out[78]: 
    Bincenter  Binsize  ...   lifetime  sd_lifetime
0       139.0     4102  ...  38.492028     2.803211
1       140.0     4252  ...  33.659729     2.534872
2       141.0     3785  ...  31.220312     2.252104
3       142.0     3823  ...  29.391562     1.783890
4       143.0     3808  ...  40.422578     2.849545

我希望，这个解释可以帮助其他人不要浪费时间，用 numpy 尝试这个。再次感谢 kwinkunks 对使用 pandas DataFrame 的非常有用的建议。

最好的，乐帕克

Python: "Binning" 个子数组

Python: "Binning" subarrays

python

numpy

sub-array

dataframe

pandas