How do I debug OverflowError: value too large to convert to int32_t?
How do I debug OverflowError: value too large to convert to int32_t?
我正在尝试做什么
我正在使用 PyArrow 读取一些 CSV 并将它们转换为 Parquet。我阅读的一些文件有很多列并且占用大量内存(足以使机器崩溃 运行 作业)所以我正在分块阅读文件。
这是我用来生成箭头表的函数的样子(为简洁起见,片段):
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import csv as arrow_csv
def generate_arrow_tables(
input_buffer: pa.lib.Buffer,
arrow_schema: pa.Schema,
batch_size: int
) -> Generator[pa.Table, None, None]:
"""
Generates an Arrow Table from given data.
:param batch_size: Size of batch streamed from CSV at a time
:param input_buffer: Takes in an Arrow BufferOutputStream
:param arrow_schema: Takes in an Arrow Schema
:return: Returns an Arrow Table
"""
# Preparing convert options
co = arrow_csv.ConvertOptions(column_types=arrow_schema, strings_can_be_null=True)
# Preparing read options
ro = arrow_csv.ReadOptions(block_size=batch_size)
# Streaming contents of CSV into batches
with arrow_csv.open_csv(input_buffer, convert_options=co, read_options=ro) as stream_reader:
for chunk in stream_reader:
if chunk is None:
break
# Emit batches from generator. Arrow schema is inferred unless explicitly specified
yield pa.Table.from_batches(batches=[chunk], schema=arrow_schema)
这就是我如何使用该函数将批次写入 S3(为简洁起见的片段):
GB = 1024 ** 3
# data.size here is the size of the buffer
arrow_tables: Generator[Table, None, None] = generate_arrow_tables(pg_data, arrow_schema, min(data.size, GB ** 10))
# Iterate through generated tables and write to S3
count = 0
for table in arrow_tables:
count += 1 # Count based on batch size
# Write keys to S3
file_name = f'{ARGS.run_id}-{count}.parquet'
write_to_s3(table, output_path=f"s3://{bucket}/{bucket_prefix}/{file_name}")
怎么回事
我收到以下错误OverflowError: value too large to convert to int32_t
这是堆栈跟踪(为简洁起见的片段):
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b' ro = arrow_csv.ReadOptions(block_size=batch_size)\n'
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b' File "pyarrow/_csv.pyx", line 87, in pyarrow._csv.ReadOptions.__init__\n'
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b' File "pyarrow/_csv.pyx", line 119, in pyarrow._csv.ReadOptions.block_size.__set__\n'
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b'OverflowError: value too large to convert to int32_t\n'
我该如何调试这个问题 and/or 解决它?
如果需要,我很乐意提供更多信息
如果我理解正确,generate_arrow_tables
的第三个参数是 batch_size
,您将其作为 block_size
传递给 CSV reader。我不确定 data.size
的值是多少,但你用 min(data.size, GB ** 10)
.
保护它
10GB 的 block_size
将不起作用。您收到的错误是块大小不适合带符号的 32 位整数(最大 ~2GB)。
除此限制外,我不确定使用比默认值 (1MB) 大得多的块大小是否是个好主意。我不期望您会看到很多性能优势,并且您最终会使用比您需要的更多的 RAM。
我正在尝试做什么
我正在使用 PyArrow 读取一些 CSV 并将它们转换为 Parquet。我阅读的一些文件有很多列并且占用大量内存(足以使机器崩溃 运行 作业)所以我正在分块阅读文件。
这是我用来生成箭头表的函数的样子(为简洁起见,片段):
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import csv as arrow_csv
def generate_arrow_tables(
input_buffer: pa.lib.Buffer,
arrow_schema: pa.Schema,
batch_size: int
) -> Generator[pa.Table, None, None]:
"""
Generates an Arrow Table from given data.
:param batch_size: Size of batch streamed from CSV at a time
:param input_buffer: Takes in an Arrow BufferOutputStream
:param arrow_schema: Takes in an Arrow Schema
:return: Returns an Arrow Table
"""
# Preparing convert options
co = arrow_csv.ConvertOptions(column_types=arrow_schema, strings_can_be_null=True)
# Preparing read options
ro = arrow_csv.ReadOptions(block_size=batch_size)
# Streaming contents of CSV into batches
with arrow_csv.open_csv(input_buffer, convert_options=co, read_options=ro) as stream_reader:
for chunk in stream_reader:
if chunk is None:
break
# Emit batches from generator. Arrow schema is inferred unless explicitly specified
yield pa.Table.from_batches(batches=[chunk], schema=arrow_schema)
这就是我如何使用该函数将批次写入 S3(为简洁起见的片段):
GB = 1024 ** 3
# data.size here is the size of the buffer
arrow_tables: Generator[Table, None, None] = generate_arrow_tables(pg_data, arrow_schema, min(data.size, GB ** 10))
# Iterate through generated tables and write to S3
count = 0
for table in arrow_tables:
count += 1 # Count based on batch size
# Write keys to S3
file_name = f'{ARGS.run_id}-{count}.parquet'
write_to_s3(table, output_path=f"s3://{bucket}/{bucket_prefix}/{file_name}")
怎么回事
我收到以下错误OverflowError: value too large to convert to int32_t
这是堆栈跟踪(为简洁起见的片段):
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b' ro = arrow_csv.ReadOptions(block_size=batch_size)\n'
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b' File "pyarrow/_csv.pyx", line 87, in pyarrow._csv.ReadOptions.__init__\n'
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b' File "pyarrow/_csv.pyx", line 119, in pyarrow._csv.ReadOptions.block_size.__set__\n'
[2021-08-04 11:26:45,479] {pod_launcher.py:156} INFO - b'OverflowError: value too large to convert to int32_t\n'
我该如何调试这个问题 and/or 解决它?
如果需要,我很乐意提供更多信息
如果我理解正确,generate_arrow_tables
的第三个参数是 batch_size
,您将其作为 block_size
传递给 CSV reader。我不确定 data.size
的值是多少,但你用 min(data.size, GB ** 10)
.
10GB 的 block_size
将不起作用。您收到的错误是块大小不适合带符号的 32 位整数(最大 ~2GB)。
除此限制外,我不确定使用比默认值 (1MB) 大得多的块大小是否是个好主意。我不期望您会看到很多性能优势,并且您最终会使用比您需要的更多的 RAM。