如何在读取文件期间获取每个时间戳的频率并减少内存

Question

我卡在了超出内存限制，因为我使用了一种天真的方法来完成创建大小为 2^25 的数组的任务。通过使用数组的索引来表示时间戳，我可以在 O(1) 时间内找到时间戳并添加它的出现，但它浪费了大量内存。所以想知道有没有什么方法可以减少内存。时间戳将按time_stamp - (time_stamp % 3600)按小时分组，它会减去start_time以避免索引溢出。

如何减少内存使用并快速读取时间戳和发生次数？ *输入时间戳可以任意顺序

#define max_entry 2 << 25
#define start_time 1645491600
struct pair
{
    long timestamp;
    int occurrence;
};

struct pair *counter_array = (struct pair *)calloc(max_entry, sizeof(struct pair)); // declear pair array

//readfile
char *temp;
 
while (fgets(buffer, buffer_size, common_file) != NULL)
{
     char *temp;
     long time_stamp = strtol(buffer, &temp, 10);                                                               
     counter_array[time_stamp - (time_stamp % 3600) - start_time].timestamp = time_stamp - (time_stamp % 3600);
     counter_array[time_stamp - (time_stamp % 3600) - start_time].occurrence += 1;                           
}

我知道使用max_entry不好，但我真的不知道怎么修改它。

Answer 1

由于您将 time_stamp 舍入为 hours，只需使用

time_stamp_index = (time_stamp - start_time) / 3600;

修改您的 struct 以存储频率和时间：

struct hour_freq {
    int hours;
    int freq;
};

如果 time_stamp 可以达到 INT_MAX (2³¹ - 1) 即 2147483647。然后 max_index 将是

max_index = (2147483647 - 1645491600) / 3600 = 139,442

完全在 50 MiB 的内存限制内。使用 45 MiB，您可以将 time_stamp 索引到 22,879,155,600。

(( 22879155600 − 1645491600 ) ÷ 3600 ) × 8 = 47,185,920 octets = 45 MiB

要以 hour 粒度查找 time_stamp 的频率，请计算 time_stamp_index 并查找。

如何在读取文件期间获取每个时间戳的频率并减少内存

How to get the frequency of each time stamp during read file and reduce the memory

c

memory

arrays

file