计算 HDF5 数据的均值和标准差

Question

我目前正在运行进行 100 次模拟，每次模拟计算 1M 个值（即每个 episode/iteration 有一个值）。

主要程序

我的 main 文件如下所示：

# Defining the test simulation environment
def test_simulation:
    environment = environment(
            periods = 1000000
            parameter_x = ...
            parameter_y = ...
      )

    # Defining the simulation
    environment.simulation()

# Run the simulation 100 times
for i in range(100):
    print(f'--- Iteration {i} ---')
    test_simulation()

模拟过程如下：在game()内我生成一个value_history连续追加：

def simulation:
    for episode in range(periods):
        value = doSomething()
        self.value_history.append(value)

因此，对于每个 episode/iteration，我计算一个 value 是一个数组，例如[1.4 1.9]（玩家 1 在当前 episode/iteration 中拥有 1.4，玩家 2 拥有 1.9）。

模拟数据的存储

为了存储数据，我使用了中提出的方法，效果非常好。

在运行模拟之后，我收到以下 Keys 结构：

Keys: <KeysViewHDF5 ['data_000', 'data_001', 'data_002', ..., 'data_100']>

计算文件的统计数据

现在，目标是计算我运行的 100 个 data 文件中每个值的平均值和标准差，这意味着，最后，我会得到一个 final_data 由 1M 平均值和 1M 标准偏差组成的集合（在 100 次模拟中，每一行（对于每个玩家）一个平均值和一个标准偏差）。

因此，目标是获得类似以下结构的内容 [average_player1, average_player2], [std_player1, std_player2]:

episode == 1: [1.5, 1.5], [0.1, 0.2]
episode == 2: [1.4, 1.6], [0.2, 0.3]
...
episode == 1000000: [1.7, 1.6], [0.1, 0.3]

我目前使用以下代码将存储它的数据提取到一个空列表中：

def ExtractSimData(name, simulation_runs, length):
        # Create empty list
        result = []

        # Call the simulation run file
        filename = f"runs/{length}/{name}_simulation_runs2.h5"

        with h5py.File(filename, "r") as hf:

            # List all groups
            print("Keys: %s" % hf.keys())

            for i in range(simulation_runs):
                a_group_key = list(hf.keys())[i]
                data = list(hf[a_group_key])

                for element in data:
                    result.append(element)

result 的数据结构如下所示：

[array([1.9, 1.7]), array([1.4, 1.9]), array([1.6, 1.5]), ...]

第一次尝试计算均值

我尝试使用以下代码得出第一个元素的平均分数（该数组由两个元素组成，因为模拟中有两个玩家）：

mean_result = [np.mean(k) for k in zip(*list(result))]

但是，这会计算整个列表中数组中每个元素的平均值，因为我将每个 data 集附加到空列表。然而，我的目标是计算上面定义的 100 个 data 集的 average/standard 偏差（即一个值是所有 100 个数据集的 average/standard 偏差）。

有什么方法可以有效地做到这一点？

Answer 1

这会计算 1 个文件中多个数据集的 episode/player 值的平均值和标准差。我想这就是你想要做的。如果没有，我可以根据需要进行修改。（注意：我创建了一个小的伪数据 HDF5 文件来复制您所描述的内容。为了完整起见，该代码位于此 post 的末尾。）

过程中的步骤概述如下（打开文件后）：

从文件获取基本大小信息：数据集计数和数据集行数
使用上面的值来调整玩家 1 和 2 值的数组大小（变量 p1_arr 和 p2_arr）。 shape[0] 是情节（行）计数，shape[1] 是模拟（数据集）计数。
遍历所有数据集。我使用了 hf.keys()（遍历数据集名称）。您还可以遍历之前创建的列表 ds_names 中的名称。（我创建它是为了简化步骤 2 中的尺寸计算）。 enumerate() 计数器 i 用于将每个模拟的剧集值索引到每个播放器数组中的正确列。
要获取每行的平均值和标准偏差，请使用带有 axis=1 参数的 np.mean() 和 np.std() 函数。计算每行模拟结果的平均值。
接下来，将数据加载到结果数据集中。我创建了 2 个数据集（相同的数据，不同的数据类型），如下所述：
一种。 'final_data' 数据集是 shape=(# of episodes,4) 的简单浮点数组，您需要知道其中每一列的值。（我建议在文档中添加一个属性。）
b. 'final_data_named' 数据集使用 NumPy recarray，因此您可以命名字段（列）。它有 shape=(# of episodes,)。您可以按名称访问每一列。

统计注意事项：计算对 sum() 运算符在值范围内的行为很敏感。如果您的数据定义明确，则 NumPy 函数是合适的。几年前我调查过这个。有关所有详细信息，请参阅此讨论：when to use numpy vs statistics modules

读取和计算以下统计数据的代码。

import h5py
import numpy as np

def ExtractSimData(name, simulation_runs, length):

    # Call the simulation run file
    filename = f"runs/{length}/{name}simulation_runs2.h5"
    with h5py.File(filename, "a") as hf:
        # List all dataset names
        ds_names = list(hf.keys())
        print(f'Dataset names (keys): {ds_names}')

        # Create empty arrays for player1 and player2 episode values
        sim_cnt = len(ds_names)
        print(f'# of simulation runs (dataset count) = {sim_cnt}')
        ep_cnt = hf[ ds_names[0] ].shape[0]
        print(f'# of episodes (rows) in each dataset = {ep_cnt}')
        p1_arr = np.empty((ep_cnt,sim_cnt))
        p2_arr = np.empty((ep_cnt,sim_cnt))
        
        for i, ds in enumerate(hf.keys()): # each dataset is 1 simulation               
            p1_arr[:,i] = hf[ds][:,0]
            p2_arr[:,i] = hf[ds][:,1]
                
        ds1 = hf.create_dataset('final_data', shape=(ep_cnt,4), 
                          compression='gzip', chunks=True)   
        ds1[:,0] = np.mean(p1_arr, axis=1)
        ds1[:,1] = np.std(p1_arr, axis=1)
        ds1[:,2] = np.mean(p2_arr, axis=1)
        ds1[:,3] = np.std(p2_arr, axis=1)        

        dt = np.dtype([ ('average_player1',float), ('average_player2',float), 
                        ('std_player1',float), ('std_player2',float) ] )
        ds2 = hf.create_dataset('final_data_named', shape=(ep_cnt,), dtype=dt, 
                          compression='gzip', chunks=True)   
        ds2['average_player1'] = np.mean(p1_arr, axis=1)
        ds2['std_player1'] = np.std(p1_arr, axis=1)
        ds2['average_player2'] = np.mean(p2_arr, axis=1)
        ds2['std_player2'] = np.std(p2_arr, axis=1)        

### main ###
simulation_runs = 10
length='01'
name='test_'
ExtractSimData(name, simulation_runs, length)

创建伪数据 HDF5 文件的代码如下。

import h5py
import numpy as np

# Create some psuedo-test data
def test_simulation(i):
    players = 2
    periods = 1000

    # Define the simulation with some random data
    val_hist = np.random.random(periods*players).reshape(periods,players)    
    
    if i == 0:
        mode='w'
    else:
        mode='a'
    # Save simulation data (unique datasets)
    with h5py.File('runs/01/test_simulation_runs2.h5', mode) as hf:
        hf.create_dataset(f'data_{i:03}', data=val_hist, 
                          compression='gzip', chunks=True)

# Run the simulation N times
simulations = 10
for i in range(simulations):
    print(f'--- Iteration {i} ---')
    test_simulation(i)

计算 HDF5 数据的均值和标准差

Compute mean and standard deviation for HDF5 data

python

simulation

hdf5

mean

h5py

主要程序

模拟数据的存储

计算文件的统计数据

第一次尝试计算均值