优化C#中一个Huge List<T>的过程
Optimising the process of a Huge List<T> in C#
我正在研究一种调度算法,该算法根据以下限制将 generates/assigns 个时隙分配给收件人列表:
- 每分钟最多收件人
- 每小时最大收件人数
假设投递开始时间是 2018-10-17 9:00 上午,我们有 19 个收件人,每分钟最多 5 个,每小时最多 10 个,所以输出应该是:
- 5 位收件人将安排在 2018-10-17 9:00 AM
- 5 位收件人将安排在 2018-10-17 9:01 AM
- 5 位收件人将安排在 2018-10-17 10:00 AM
- 4 位收件人将安排在 2018-10-17 10:01 AM
该算法非常准确,但其工作方式如下:
- 首先,它会生成一个时隙列表或时间-windows,这些列表准确地符合编号。收件人数量基于我之前提到的限制。
- 然后,我将移动每个 set/group 或收件人的时隙列表中可用的任何内容。
- 在时隙列表中,我添加了一个计数器,它会随着添加到其中的每个收件人而递增,这样我就可以跟踪号码。添加到每个时隙的每个收件人的数量,以遵守每分钟/小时的最大限制。
在此代码段中简化了之前的过程 - 我正在使用 While 循环进行迭代,在我有 500K 收件人的情况下,这需要 28 分钟才能完成!
我尝试使用 Parallel.ForEach 但我不知道在这种情况下如何实现它。
DateTime DeliveryStart = DateTime.Now;
//This list has DateTime: Time-windows values starting from DeliveryStart to the Max value of the time needed to schedule the Recipients
var listOfTimeSlots = new List<Tuple<DateTime, bool, int>>();
//List of Recipients with Two types of data: DateTime to tell when its scheduled and int value refers to the Recipient's ID
var ListOfRecipients = new List<Tuple<DateTime, int>>();
List<Tuple<int, DateTime>> RecipientsWithTimeSlots= new List<Tuple<int, DateTime>>();
int noOfRecipients = ListOfRecipients.Count;
int Prevhour = 0, _AddedPerHour = 0, Prevday = 0;
// Scheduling restrictions
int _MaxPerHour = 5400, _MaxPerMinute = 90;
int i = 0;
int indexStart = 0;
// ...
// ...
// Code to fill listOfTimeSlots ListOfRecipients with Data
while (noOfRecipients > 0)
{
var TimeStamp = listOfTimeSlots[i];
int hour = TimeStamp.Item1.Hour;
int day = TimeStamp.Item1.Day;
if (Prevhour == 0)
{
Prevhour = hour;
Prevday = day;
}
if (Prevhour != hour)
{
Prevhour = hour;
_AddedPerHour = 0;
}
if (_AddedPerHour >= _MaxPerHour)
{
var tmpItem = listOfTimeSlots.Where(l => l.Item1.Hour == hour && l.Item1.Day == day).LastOrDefault();
int indexOfNextItem = listOfTimeSlots.LastIndexOf(tmpItem) + 1;
i = indexOfNextItem;
_AddedPerHour = 0;
continue;
}
else
{
int endIndex;
endIndex = _MaxPerMinute > noOfRecipients ? noOfRecipients : _MaxPerMinute;
if (endIndex > Math.Abs(_AddedPerHour - _MaxPerHour))
endIndex = Math.Abs(_AddedPerHour - _MaxPerHour);
var RecipientsToIteratePerMinute = ListOfRecipients.GetRange(indexStart, endIndex);
foreach (var item in RecipientsToIteratePerMinute)
{
RecipientsWithTimeSlots.Add(new Tuple<int, DateTime>(item.Item2, TimeStamp.Item1));
listOfTimeSlots[i] = new Tuple<DateTime, bool, int>(TimeStamp.Item1, true, listOfTimeSlots[i].Item3 + 1);
_AddedPerHour++;
}
indexStart += endIndex;
noOfRecipients -= endIndex;
i++;
}
}
我简化了这里的代码,为了不让它太难理解,我只希望它能加速 while 循环或用 Parallel.ForEach 替换它。
虽然循环从未被简化过,但这就是它的工作原理\
如有任何帮助或建议,我们将不胜感激。
这是一种不同的方法。它首先创建 id 组,然后根据要求为它们分配日期。
首先,一个 class 来表示组(避免使用元组):
public class RecipientGroup
{
public RecipientGroup(DateTime scheduledDateTime, IEnumerable<int> recipients)
{
ScheduledDateTime= scheduledDateTime;
Recipients = recipients;
}
public DateTime ScheduledDateTime { get; private set; }
public IEnumerable<int> Recipients { get; private set; }
public override string ToString()
{
return string.Format($"Date: {ScheduledDateTime.ToShortDateString()} {ScheduledDateTime.ToLongTimeString()}, count: {Recipients.Count()}");
}
}
然后 class 遍历组。稍后你会明白为什么需要这样做:
public class GroupIterator
{
public GroupIterator(DateTime scheduledDateTime)
{
ScheduledDateTime = scheduledDateTime;
}
public DateTime ScheduledDateTime { get; set; }
public int Count { get; set; }
}
现在,代码:
DateTime DeliveryStart = new DateTime(2018, 10, 17);
//List of Recipients (fake populate function)
IEnumerable<int> allRecipients = PopulateRecipients();
// Scheduling restrictions
int maxPerMinute = 90;
int maxPerHour = 270;
//Creates groups broken down by the max per minute.
var groupsPerMinute = allRecipients
.Select((s, i) => new { Value = s, Index = i })
.GroupBy(x => x.Index / maxPerMinute)
.Select(group => group.Select(x => x.Value).ToArray());
//This will be the resulting groups
var deliveryDateGroups = new List<RecipientGroup>();
//Perform an aggregate run on the groups using the iterator
groupsPerMinute.Aggregate(new GroupIterator(DeliveryStart), (iterator, ids) =>
{
var nextBreak = iterator.Count + ids.Count();
if (nextBreak >= maxPerHour)
{
//Will go over limit, split
var difference = nextBreak-maxPerHour;
var groupSize = ids.Count() - difference;
//This group completes the batch
var group = new RecipientGroup(iterator.ScheduledDateTime, ids.Take(groupSize));
deliveryDateGroups.Add(group);
var newDate = iterator.ScheduledDateTime.AddHours(1).AddMinutes(-iterator.ScheduledDateTime.Minute);
//Add new group with remaining recipients.
var stragglers = new RecipientGroup(newDate, ids.Skip(groupSize));
deliveryDateGroups.Add(stragglers);
return new GroupIterator(newDate, difference);
}
else
{
var group = new RecipientGroup(iterator.ScheduledDateTime, ids);
deliveryDateGroups.Add(group);
iterator.ScheduledDateTime = iterator.ScheduledDateTime.AddMinutes(1);
iterator.Count += ids.Count();
return iterator;
}
});
//Output minute group count
Console.WriteLine($"Group count: {deliveryDateGroups.Count}");
//Groups by hour
var byHour = deliveryDateGroups.GroupBy(g => new DateTime(g.ScheduledDateTime.Year, g.ScheduledDateTime.Month, g.ScheduledDateTime.Day, g.ScheduledDateTime.Hour, 0, 0));
Console.WriteLine($"Hour Group count: {byHour.Count()}");
foreach (var group in byHour)
{
Console.WriteLine($"Date: {group.Key.ToShortDateString()} {group.Key.ToShortTimeString()}; Count: {group.Count()}; Recipients: {group.Sum(g => g.Recipients.Count())}");
}
输出:
Group count: 5556
Hour Group count: 1852
Date: 10/17/2018 12:00 AM; Count: 3; Recipients: 270
Date: 10/17/2018 1:00 AM; Count: 3; Recipients: 270
Date: 10/17/2018 2:00 AM; Count: 3; Recipients: 270
Date: 10/17/2018 3:00 AM; Count: 3; Recipients: 270
Date: 10/17/2018 4:00 AM; Count: 3; Recipients: 270
Date: 10/17/2018 5:00 AM; Count: 3; Recipients: 270
... and so on for all 1852 groups.
这大约需要 3 秒才能完成。
我确信存在极端情况。写的太匆忙了,先想一想吧。
我正在研究一种调度算法,该算法根据以下限制将 generates/assigns 个时隙分配给收件人列表:
- 每分钟最多收件人
- 每小时最大收件人数
假设投递开始时间是 2018-10-17 9:00 上午,我们有 19 个收件人,每分钟最多 5 个,每小时最多 10 个,所以输出应该是:
- 5 位收件人将安排在 2018-10-17 9:00 AM
- 5 位收件人将安排在 2018-10-17 9:01 AM
- 5 位收件人将安排在 2018-10-17 10:00 AM
- 4 位收件人将安排在 2018-10-17 10:01 AM
该算法非常准确,但其工作方式如下:
- 首先,它会生成一个时隙列表或时间-windows,这些列表准确地符合编号。收件人数量基于我之前提到的限制。
- 然后,我将移动每个 set/group 或收件人的时隙列表中可用的任何内容。
- 在时隙列表中,我添加了一个计数器,它会随着添加到其中的每个收件人而递增,这样我就可以跟踪号码。添加到每个时隙的每个收件人的数量,以遵守每分钟/小时的最大限制。
在此代码段中简化了之前的过程 - 我正在使用 While 循环进行迭代,在我有 500K 收件人的情况下,这需要 28 分钟才能完成! 我尝试使用 Parallel.ForEach 但我不知道在这种情况下如何实现它。
DateTime DeliveryStart = DateTime.Now;
//This list has DateTime: Time-windows values starting from DeliveryStart to the Max value of the time needed to schedule the Recipients
var listOfTimeSlots = new List<Tuple<DateTime, bool, int>>();
//List of Recipients with Two types of data: DateTime to tell when its scheduled and int value refers to the Recipient's ID
var ListOfRecipients = new List<Tuple<DateTime, int>>();
List<Tuple<int, DateTime>> RecipientsWithTimeSlots= new List<Tuple<int, DateTime>>();
int noOfRecipients = ListOfRecipients.Count;
int Prevhour = 0, _AddedPerHour = 0, Prevday = 0;
// Scheduling restrictions
int _MaxPerHour = 5400, _MaxPerMinute = 90;
int i = 0;
int indexStart = 0;
// ...
// ...
// Code to fill listOfTimeSlots ListOfRecipients with Data
while (noOfRecipients > 0)
{
var TimeStamp = listOfTimeSlots[i];
int hour = TimeStamp.Item1.Hour;
int day = TimeStamp.Item1.Day;
if (Prevhour == 0)
{
Prevhour = hour;
Prevday = day;
}
if (Prevhour != hour)
{
Prevhour = hour;
_AddedPerHour = 0;
}
if (_AddedPerHour >= _MaxPerHour)
{
var tmpItem = listOfTimeSlots.Where(l => l.Item1.Hour == hour && l.Item1.Day == day).LastOrDefault();
int indexOfNextItem = listOfTimeSlots.LastIndexOf(tmpItem) + 1;
i = indexOfNextItem;
_AddedPerHour = 0;
continue;
}
else
{
int endIndex;
endIndex = _MaxPerMinute > noOfRecipients ? noOfRecipients : _MaxPerMinute;
if (endIndex > Math.Abs(_AddedPerHour - _MaxPerHour))
endIndex = Math.Abs(_AddedPerHour - _MaxPerHour);
var RecipientsToIteratePerMinute = ListOfRecipients.GetRange(indexStart, endIndex);
foreach (var item in RecipientsToIteratePerMinute)
{
RecipientsWithTimeSlots.Add(new Tuple<int, DateTime>(item.Item2, TimeStamp.Item1));
listOfTimeSlots[i] = new Tuple<DateTime, bool, int>(TimeStamp.Item1, true, listOfTimeSlots[i].Item3 + 1);
_AddedPerHour++;
}
indexStart += endIndex;
noOfRecipients -= endIndex;
i++;
}
}
我简化了这里的代码,为了不让它太难理解,我只希望它能加速 while 循环或用 Parallel.ForEach 替换它。
虽然循环从未被简化过,但这就是它的工作原理\
如有任何帮助或建议,我们将不胜感激。
这是一种不同的方法。它首先创建 id 组,然后根据要求为它们分配日期。
首先,一个 class 来表示组(避免使用元组):
public class RecipientGroup
{
public RecipientGroup(DateTime scheduledDateTime, IEnumerable<int> recipients)
{
ScheduledDateTime= scheduledDateTime;
Recipients = recipients;
}
public DateTime ScheduledDateTime { get; private set; }
public IEnumerable<int> Recipients { get; private set; }
public override string ToString()
{
return string.Format($"Date: {ScheduledDateTime.ToShortDateString()} {ScheduledDateTime.ToLongTimeString()}, count: {Recipients.Count()}");
}
}
然后 class 遍历组。稍后你会明白为什么需要这样做:
public class GroupIterator
{
public GroupIterator(DateTime scheduledDateTime)
{
ScheduledDateTime = scheduledDateTime;
}
public DateTime ScheduledDateTime { get; set; }
public int Count { get; set; }
}
现在,代码:
DateTime DeliveryStart = new DateTime(2018, 10, 17);
//List of Recipients (fake populate function)
IEnumerable<int> allRecipients = PopulateRecipients();
// Scheduling restrictions
int maxPerMinute = 90;
int maxPerHour = 270;
//Creates groups broken down by the max per minute.
var groupsPerMinute = allRecipients
.Select((s, i) => new { Value = s, Index = i })
.GroupBy(x => x.Index / maxPerMinute)
.Select(group => group.Select(x => x.Value).ToArray());
//This will be the resulting groups
var deliveryDateGroups = new List<RecipientGroup>();
//Perform an aggregate run on the groups using the iterator
groupsPerMinute.Aggregate(new GroupIterator(DeliveryStart), (iterator, ids) =>
{
var nextBreak = iterator.Count + ids.Count();
if (nextBreak >= maxPerHour)
{
//Will go over limit, split
var difference = nextBreak-maxPerHour;
var groupSize = ids.Count() - difference;
//This group completes the batch
var group = new RecipientGroup(iterator.ScheduledDateTime, ids.Take(groupSize));
deliveryDateGroups.Add(group);
var newDate = iterator.ScheduledDateTime.AddHours(1).AddMinutes(-iterator.ScheduledDateTime.Minute);
//Add new group with remaining recipients.
var stragglers = new RecipientGroup(newDate, ids.Skip(groupSize));
deliveryDateGroups.Add(stragglers);
return new GroupIterator(newDate, difference);
}
else
{
var group = new RecipientGroup(iterator.ScheduledDateTime, ids);
deliveryDateGroups.Add(group);
iterator.ScheduledDateTime = iterator.ScheduledDateTime.AddMinutes(1);
iterator.Count += ids.Count();
return iterator;
}
});
//Output minute group count
Console.WriteLine($"Group count: {deliveryDateGroups.Count}");
//Groups by hour
var byHour = deliveryDateGroups.GroupBy(g => new DateTime(g.ScheduledDateTime.Year, g.ScheduledDateTime.Month, g.ScheduledDateTime.Day, g.ScheduledDateTime.Hour, 0, 0));
Console.WriteLine($"Hour Group count: {byHour.Count()}");
foreach (var group in byHour)
{
Console.WriteLine($"Date: {group.Key.ToShortDateString()} {group.Key.ToShortTimeString()}; Count: {group.Count()}; Recipients: {group.Sum(g => g.Recipients.Count())}");
}
输出:
Group count: 5556
Hour Group count: 1852
Date: 10/17/2018 12:00 AM; Count: 3; Recipients: 270
Date: 10/17/2018 1:00 AM; Count: 3; Recipients: 270
Date: 10/17/2018 2:00 AM; Count: 3; Recipients: 270
Date: 10/17/2018 3:00 AM; Count: 3; Recipients: 270
Date: 10/17/2018 4:00 AM; Count: 3; Recipients: 270
Date: 10/17/2018 5:00 AM; Count: 3; Recipients: 270
... and so on for all 1852 groups.
这大约需要 3 秒才能完成。
我确信存在极端情况。写的太匆忙了,先想一想吧。