在 .net 中使用 Apache Arrow ArrowFileReader 读取文件
Reading file with Apache Arrow ArrowFileReader in .net
我正在尝试读取箭头文件的内容,但无法找到从中获取实际数据的函数。我也找不到任何有用的例子来读取数据。例如 here.
C#读写代码示例:
// Write
var recordBatch = new Apache.Arrow.RecordBatch.Builder(memoryAllocator)
.Append("Column A", false, col => col.Int32(array => array.AppendRange(Enumerable.Range(5, 15))))
.Build();
using (var stream = File.OpenWrite(filePath))
using (var writer = new Apache.Arrow.Ipc.ArrowFileWriter(stream, recordBatch.Schema, true))
{
await writer.WriteRecordBatchAsync(recordBatch);
await writer.WriteEndAsync();
}
// Read
var reader = Apache.Arrow.Ipc.ArrowFileReader.FromFile(filePath);
var readBatch = await reader.ReadNextRecordBatchAsync();
var col = readBatch.Column(0);
通过调试代码,我可以看到 col Values 属性 中的值,但我无法在代码中访问此信息。
我是否遗漏了什么或者是否有不同的方法来读取数据?
Apache.Arrow
包今天不做任何计算。它将读入文件,您将可以访问数据的原始缓冲区。这对于许多中间任务(例如,将数据往返于数据文件或聚合数据文件的服务)来说已经足够了。因此,如果您想对数据进行大量操作,您可能需要某种数据框库。
一个这样的库是 Microsoft.Data.Analysis
library which has added a DataFrame
type which can be created from an Arrow RecordBatch. There is some explanation and examples of the library in this blog post.
我没怎么使用过那个库,但我能够整理一个读取 Arrow 文件和打印数据的简短示例:
using System;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
using Apache.Arrow.Ipc;
using Microsoft.Data.Analysis;
namespace DataframeExperiment
{
class Program
{
static async Task AsyncMain()
{
using (var stream = File.OpenRead("/tmp/test.arrow"))
using (var reader = new ArrowFileReader(stream))
{
var recordBatch = await reader.ReadNextRecordBatchAsync();
Console.WriteLine("Read record batch with {0} column(s)", recordBatch.ColumnCount);
var dataframe = DataFrame.FromArrowRecordBatch(recordBatch);
var columnX = dataframe["x"];
foreach (var value in columnX)
{
Console.WriteLine(value);
}
}
}
static void Main(string[] args)
{
AsyncMain().Wait();
}
}
}
我用一个小 python 脚本创建了测试文件:
import pyarrow as pa
import pyarrow.ipc as ipc
tab = pa.Table.from_pydict({'x': [1, 2, 3], 'y': ['x', 'y', 'z']})
with ipc.RecordBatchFileWriter('/tmp/test.arrow', schema=tab.schema) as writer:
writer.write_table(tab)
您可能还可以使用 C# 和 Apache.Arrow
的 array builders.
创建测试文件
更新(直接使用Apache.Arrow
)
另一方面,如果您想直接使用 Apache.Arrow
,并且仍然可以访问数据,那么您可以使用类型化数组(例如 Int32Array、Int64Array)。您首先需要以某种方式确定数组的类型(通过模式的先验知识或 as
/ is
样式检查或模式匹配)。
这里是一个单独使用 Apache.Arrow
的例子:
using System;
using System.IO;
using System.Threading.Tasks;
using Apache.Arrow;
using Apache.Arrow.Ipc;
namespace ArrayValuesExperiment
{
class Program
{
static async Task AsyncMain()
{
using (var stream = File.OpenRead("/tmp/test.arrow"))
using (var reader = new ArrowFileReader(stream))
{
var recordBatch = await reader.ReadNextRecordBatchAsync();
// Here I am relying on the fact that I know column
// 0 is an int64 array.
var columnX = (Int64Array) recordBatch.Column(0);
for (int i = 0; i < columnX.Values.Length; i++)
{
Console.WriteLine(columnX.Values[i]);
}
}
}
static void Main(string[] args)
{
AsyncMain().Wait();
}
}
}
我正在尝试读取箭头文件的内容,但无法找到从中获取实际数据的函数。我也找不到任何有用的例子来读取数据。例如 here.
C#读写代码示例:
// Write
var recordBatch = new Apache.Arrow.RecordBatch.Builder(memoryAllocator)
.Append("Column A", false, col => col.Int32(array => array.AppendRange(Enumerable.Range(5, 15))))
.Build();
using (var stream = File.OpenWrite(filePath))
using (var writer = new Apache.Arrow.Ipc.ArrowFileWriter(stream, recordBatch.Schema, true))
{
await writer.WriteRecordBatchAsync(recordBatch);
await writer.WriteEndAsync();
}
// Read
var reader = Apache.Arrow.Ipc.ArrowFileReader.FromFile(filePath);
var readBatch = await reader.ReadNextRecordBatchAsync();
var col = readBatch.Column(0);
通过调试代码,我可以看到 col Values 属性 中的值,但我无法在代码中访问此信息。
Apache.Arrow
包今天不做任何计算。它将读入文件,您将可以访问数据的原始缓冲区。这对于许多中间任务(例如,将数据往返于数据文件或聚合数据文件的服务)来说已经足够了。因此,如果您想对数据进行大量操作,您可能需要某种数据框库。
一个这样的库是 Microsoft.Data.Analysis
library which has added a DataFrame
type which can be created from an Arrow RecordBatch. There is some explanation and examples of the library in this blog post.
我没怎么使用过那个库,但我能够整理一个读取 Arrow 文件和打印数据的简短示例:
using System;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
using Apache.Arrow.Ipc;
using Microsoft.Data.Analysis;
namespace DataframeExperiment
{
class Program
{
static async Task AsyncMain()
{
using (var stream = File.OpenRead("/tmp/test.arrow"))
using (var reader = new ArrowFileReader(stream))
{
var recordBatch = await reader.ReadNextRecordBatchAsync();
Console.WriteLine("Read record batch with {0} column(s)", recordBatch.ColumnCount);
var dataframe = DataFrame.FromArrowRecordBatch(recordBatch);
var columnX = dataframe["x"];
foreach (var value in columnX)
{
Console.WriteLine(value);
}
}
}
static void Main(string[] args)
{
AsyncMain().Wait();
}
}
}
我用一个小 python 脚本创建了测试文件:
import pyarrow as pa
import pyarrow.ipc as ipc
tab = pa.Table.from_pydict({'x': [1, 2, 3], 'y': ['x', 'y', 'z']})
with ipc.RecordBatchFileWriter('/tmp/test.arrow', schema=tab.schema) as writer:
writer.write_table(tab)
您可能还可以使用 C# 和 Apache.Arrow
的 array builders.
更新(直接使用Apache.Arrow
)
另一方面,如果您想直接使用 Apache.Arrow
,并且仍然可以访问数据,那么您可以使用类型化数组(例如 Int32Array、Int64Array)。您首先需要以某种方式确定数组的类型(通过模式的先验知识或 as
/ is
样式检查或模式匹配)。
这里是一个单独使用 Apache.Arrow
的例子:
using System;
using System.IO;
using System.Threading.Tasks;
using Apache.Arrow;
using Apache.Arrow.Ipc;
namespace ArrayValuesExperiment
{
class Program
{
static async Task AsyncMain()
{
using (var stream = File.OpenRead("/tmp/test.arrow"))
using (var reader = new ArrowFileReader(stream))
{
var recordBatch = await reader.ReadNextRecordBatchAsync();
// Here I am relying on the fact that I know column
// 0 is an int64 array.
var columnX = (Int64Array) recordBatch.Column(0);
for (int i = 0; i < columnX.Values.Length; i++)
{
Console.WriteLine(columnX.Values[i]);
}
}
}
static void Main(string[] args)
{
AsyncMain().Wait();
}
}
}