如何在 TPL 数据流中执行异步操作以获得最佳性能？

Question

我写了下面的方法来批量处理一个巨大的CSV文件。这个想法是从文件中读取一大块行到内存中，然后将这些行块分成固定大小的批次。获得分区后，将这些分区发送到服务器（同步或异步），这可能需要一段时间。

private static void BatchProcess(string filePath, int chunkSize, int batchSize)
{
    List<string> chunk = new List<string>(chunkSize);

    foreach (var line in File.ReadLines(filePath))
    {
        if (chunk.Count == chunk.Capacity)
        {
            // Partition each chunk into smaller chunks grouped on column 1
            var partitions = chunk.GroupBy(c => c.Split(',')[0], (key, g) => g);

            // Further breakdown the chunks into batch size groups
            var groups = partitions.Select(x => x.Select((i, index) =>
                new { i, index }).GroupBy(g => g.index / batchSize, e => e.i));

            // Get batches from groups
            var batches = groups.SelectMany(x => x)
                .Select(y => y.Select(z => z)).ToList();

            // Process all batches asynchronously
            batches.AsParallel().ForAll(async b =>
            {
                WebClient client = new WebClient();
                byte[] bytes = System.Text.Encoding.ASCII
                    .GetBytes(b.SelectMany(x => x).ToString());
                await client.UploadDataTaskAsync("myserver.com", bytes);
            });

            // clear the chunk
            chunk.Clear();
        }

        chunk.Add(line);
    }
}

这段代码似乎不是很有效，原因有两个。

阻塞读取CSV文件的主线程，直到处理完所有分区。
AsParallel 阻塞直到所有任务完成。因此，如果线程池中有更多线程可用于工作，我不会使用它们，因为任务数不受分区数的限制。

batchSize 是固定的，因此无法更改，但 chunkSize 可以针对性能进行调整。我可以选择足够大的 chunkSize，这样创建的批次中没有 >> 系统中没有可用线程，但这仍然意味着 Parallel.ForEach 方法会阻塞，直到所有任务完成。

我如何更改代码，以便利用系统中的所有可用线程来完成工作 w/o 闲置。我在想我可以使用 BlockingCollection 来存储批次，但不确定要给它多大的容量，因为每个块中没有批次是动态的。

关于如何使用 TPL 最大化线程利用率以便系统上的大多数可用线程始终在做事情有什么想法吗？

更新： 这是我到目前为止使用 TPL 数据流得到的结果。这是正确的吗？

private static void UploadData(string filePath, int chunkSize, int batchSize)
{
    var buffer1 = new BatchBlock<string>(chunkSize);
    var buffer2 = new BufferBlock<IEnumerable<string>>();

    var action1 = new ActionBlock<string[]>(t =>
    {
        Console.WriteLine("Got a chunk of lines " + t.Count());

        // Partition each chunk into smaller chunks grouped on column 1
        var partitions = t.GroupBy(c => c.Split(',')[0], (key, g) => g);

        // Further breakdown the chunks into batch size groups
        var groups = partitions.Select(x => x.Select((i, index) =>
            new { i, index }).GroupBy(g => g.index / batchSize, e => e.i));

        // Get batches from groups
        var batches = groups.SelectMany(x => x).Select(y => y.Select(z => z));

        foreach (var batch in batches)
        {
            buffer2.Post(batch);
        }

    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 });

    buffer1.LinkTo(action1, new DataflowLinkOptions
        { PropagateCompletion = true });

    var action2 = new TransformBlock<IEnumerable<string>,
        IEnumerable<string>>(async b =>
    {
        await ExecuteBatch(b);
        return b;

    }, new ExecutionDataflowBlockOptions
        { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });

    buffer2.LinkTo(action2, new DataflowLinkOptions
        { PropagateCompletion = true });

    var action3 = new ActionBlock<IEnumerable<string>>(b =>
    {
        Console.WriteLine("Finised executing a batch");
    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 });

    action2.LinkTo(action3, new DataflowLinkOptions
        { PropagateCompletion = true });

    Task produceTask = Task.Factory.StartNew(() =>
    {
        foreach (var line in File.ReadLines(filePath))
        {
            buffer1.Post(line);
        }

        //Once marked complete your entire data flow will signal a stop for
        // all new items
        Console.WriteLine("Finished producing");
        buffer1.Complete();
    });

    Task.WaitAll(produceTask);
    Console.WriteLine("Produced complete");

    action1.Completion.Wait();//Will add all the items to buffer2
    Console.WriteLine("Action1 complete");

    buffer2.Complete();//will not get any new items
    action2.Completion.Wait();//Process the batch of 5 and then complete

    Task.Wait(action3.Completion);

    Console.WriteLine("Process complete");
    Console.ReadLine();
}

Answer 1

你很接近，在 TPL 中，数据从一个块流到另一个块，你应该尽量保持这种范式。因此，例如 action1 应该是一个 TransformManyBlock，因为 ActionBlock 是一个 ITargetBlock（即终止块）。

当您在 link 上指定传播完成时，完成事件会自动路由到块中，因此您只需要在最后一个块上执行一个 wait()。

将 is 想象成一个多米诺骨牌链，您在第一个块上调用 complete 它将通过链传播到最后一个块。

你还应该考虑什么是多线程，为什么要多线程；您的示例受到严重 I/O 限制，我不认为绑定一堆线程来等待 I/O 完成是正确的解决方案。

最后，请注意阻塞与否。在您的示例中 buffer1.Post(...) 是不是阻塞调用，您没有理由在任务中使用它。

我编写了以下使用 TPL DataFlow 的示例代码：

static void Main(string[] args)
{
    var filePath = "C:\test.csv";
    var chunkSize = 1024;
    var batchSize = 128;

    var linkCompletion = new DataflowLinkOptions
    {
        PropagateCompletion = true
    };

    var uploadData = new ActionBlock<IEnumerable<string>>(
        async (data) =>
        {
            WebClient client = new WebClient();
            var payload = data.SelectMany(x => x).ToArray();
            byte[] bytes = System.Text.Encoding.ASCII.GetBytes(payload);
            //await client.UploadDataTaskAsync("myserver.com", bytes);
            await Task.Delay(2000);
        }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded /* Prefer to limit that to some reasonable value */ });

    var lineBuffer = new BatchBlock<string>(chunkSize);

    var splitData = new TransformManyBlock<IEnumerable<string>, IEnumerable<string>>(
        (data) =>
        {
            // Partition each chunk into smaller chunks grouped on column 1
            var partitions = data.GroupBy(c => c.Split(',')[0]);

            // Further beakdown the chunks into batch size groups
            var groups = partitions.Select(x => x.Select((i, index) => new { i, index }).GroupBy(g => g.index / batchSize, e => e.i));

            // Get batches from groups
            var batches = groups.SelectMany(x => x).Select(y => y.Select(z => z));

            // Don't forget to enumerate before returning
            return batches.ToList();
        }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });
    lineBuffer.LinkTo(splitData, linkCompletion);
    splitData.LinkTo(uploadData, linkCompletion);

    foreach (var line in File.ReadLines(filePath))
    {
        lineBuffer.Post(line);
    }
    lineBuffer.Complete();

    // Wait for uploads to finish
    uploadData.Completion.Wait();
}

如何在 TPL 数据流中执行异步操作以获得最佳性能？

How to do async operations in a TPL Dataflow for best performance?

c#

csv

task-parallel-library

blockingcollection

tpl-dataflow