在 .net 中使用 Apache Arrow ArrowFileReader 读取文件

Reading file with Apache Arrow ArrowFileReader in .net

我正在尝试读取箭头文件的内容,但无法找到从中获取实际数据的函数。我也找不到任何有用的例子来读取数据。例如 here.

C#读写代码示例:

// Write
var recordBatch = new Apache.Arrow.RecordBatch.Builder(memoryAllocator)
    .Append("Column A", false, col => col.Int32(array => array.AppendRange(Enumerable.Range(5, 15))))
    .Build();

using (var stream = File.OpenWrite(filePath))
using (var writer = new Apache.Arrow.Ipc.ArrowFileWriter(stream, recordBatch.Schema, true))
{
    await writer.WriteRecordBatchAsync(recordBatch);
    await writer.WriteEndAsync();
}

// Read
var reader = Apache.Arrow.Ipc.ArrowFileReader.FromFile(filePath);
var readBatch = await reader.ReadNextRecordBatchAsync();
var col = readBatch.Column(0);

通过调试代码,我可以看到 col Values 属性 中的值,但我无法在代码中访问此信息。 我是否遗漏了什么或者是否有不同的方法来读取数据?

Apache.Arrow 包今天不做任何计算。它将读入文件,您将可以访问数据的原始缓冲区。这对于许多中间任务(例如,将数据往返于数据文件或聚合数据文件的服务)来说已经足够了。因此,如果您想对数据进行大量操作,您可能需要某种数据框库。

一个这样的库是 Microsoft.Data.Analysis library which has added a DataFrame type which can be created from an Arrow RecordBatch. There is some explanation and examples of the library in this blog post.

我没怎么使用过那个库,但我能够整理一个读取 Arrow 文件和打印数据的简短示例:

using System;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
using Apache.Arrow.Ipc;
using Microsoft.Data.Analysis;

namespace DataframeExperiment
{
    class Program
    {
        static async Task AsyncMain()
        {
            using (var stream = File.OpenRead("/tmp/test.arrow"))
            using (var reader = new ArrowFileReader(stream))
            {
                var recordBatch = await reader.ReadNextRecordBatchAsync();
                Console.WriteLine("Read record batch with {0} column(s)", recordBatch.ColumnCount);
                var dataframe = DataFrame.FromArrowRecordBatch(recordBatch);

                var columnX = dataframe["x"];
                foreach (var value in columnX)
                {
                    Console.WriteLine(value);
                }
            }
        }
        
        static void Main(string[] args)
        {
            AsyncMain().Wait();
        }
    }
}

我用一个小 python 脚本创建了测试文件:

import pyarrow as pa
import pyarrow.ipc as ipc

tab = pa.Table.from_pydict({'x': [1, 2, 3], 'y': ['x', 'y', 'z']})
with ipc.RecordBatchFileWriter('/tmp/test.arrow', schema=tab.schema) as writer:
    writer.write_table(tab)

您可能还可以使用 C# 和 Apache.Arrowarray builders.

创建测试文件

更新(直接使用Apache.Arrow

另一方面,如果您想直接使用 Apache.Arrow,并且仍然可以访问数据,那么您可以使用类型化数组(例如 Int32Array、Int64Array)。您首先需要以某种方式确定数组的类型(通过模式的先验知识或 as / is 样式检查或模式匹配)。

这里是一个单独使用 Apache.Arrow 的例子:

using System;
using System.IO;
using System.Threading.Tasks;
using Apache.Arrow;
using Apache.Arrow.Ipc;

namespace ArrayValuesExperiment
{
    class Program
    {

        static async Task AsyncMain()
        {
            using (var stream = File.OpenRead("/tmp/test.arrow"))
            using (var reader = new ArrowFileReader(stream))
            {
                var recordBatch = await reader.ReadNextRecordBatchAsync();
                // Here I am relying on the fact that I know column
                // 0 is an int64 array.
                var columnX = (Int64Array) recordBatch.Column(0);
                for (int i = 0; i < columnX.Values.Length; i++)
                {
                    Console.WriteLine(columnX.Values[i]);
                }
            }
        }
        
        static void Main(string[] args)
        {
            AsyncMain().Wait();
        }
    }
}