以镶木地板格式存储和检索非常大的数字的问题

Problem with storing and retrieving very large numbers in parquet format

在将具有大量数字(18 位)的数字存储为镶木地板并检索时,我遇到了一个奇怪的问题。我得到不同的值。进一步钻探,看起来只有当输入列表是 None 和实际值的混合时才会出现此问题。当列表没有 None 值时,值会按预期取回。

我认为这与显示问题无关。尝试用 unix 命令显示,如 cat、vi 编辑器等,所以它看起来不像显示问题。

代码中有2个部分,

  1. 从包含 None 和大数字的列表创建镶木地板。这就是问题所在。例如:值:235313013750949476 更改为 235313013750949472,如输出所示。

  2. 从仅包含大数字且没有 None 值的列表创建镶木地板。它按预期工作。

代码

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def get_row_list():
    row_list = []

    row_list.append(None)
    row_list.append(235313013750949476)
    row_list.append(None)
    row_list.append(135313013750949496)
    row_list.append(935313013750949406)
    row_list.append(835313013750949456)
    row_list.append(None)
    row_list.append(None)

    return row_list

def get_row_list_with_no_none():
    row_list = []

    row_list.append(235313013750949476)
    row_list.append(135313013750949496)
    row_list.append(935313013750949406)
    row_list.append(835313013750949456)

    return row_list

def create_parquet(row_list, col_list, parquet_filename):
    df = pd.DataFrame(row_list, columns=col_list)

    schema_field_list = [('tree_id', pa.int64())]
    pa_schema = pa.schema(schema_field_list)

    table = pa.Table.from_pandas(df, pa_schema)

    pq_writer = pq.ParquetWriter(parquet_filename,
                                 schema=pa_schema)

    pq_writer.write_table(table)
    pq_writer.close()

    print("Parquet file [%s] created" % parquet_filename)

def main():
    col_list = ['tree_id']

    # Row list without any none
    row_list = get_row_list_with_no_none()
    print (row_list)
    create_parquet(row_list, col_list, 'without_none.parquet')

    # Row list with none
    row_list = get_row_list()
    print (row_list)
    create_parquet(row_list, col_list, 'with_none.parquet')

# ==== Main code Execution =====
if __name__ == '__main__':
    main()

[执行]

python test-parquet.py

[235313013750949476, 135313013750949496, 935313013750949406, 835313013750949456]
Parquet file [without_none.parquet] created
[None, 235313013750949476, None, 135313013750949496, 935313013750949406, 835313013750949456, None, None]
Parquet file [with_none.parquet] created

[库版本]

pyarrow                  5.0.0
pandas                   1.1.5

python -v
Python 3.6.6

[通过将镶木地板作为 spark df 进行测试]

>>> dfwithoutnone = spark.read.parquet("s3://some-bucket/without_none.parquet/")
>>> dfwithoutnone.count()
4
>>> dfwithoutnone.printSchema()
root
 |-- tree_id: long (nullable = true)

>>> dfwithoutnone.show(10, False)
+------------------+                                                            
|tree_id           |
+------------------+
|235313013750949476|
|135313013750949496|
|935313013750949406|
|835313013750949456|
+------------------+

>>> df_with_none = spark.read.parquet("s3://some-bucket/with_none.parquet/")
>>> df_with_none.count()
8                                                                               
>>> df_with_none.printSchema()
root
 |-- tree_id: long (nullable = true)

>>> df_with_none.printSchema()
root
 |-- tree_id: long (nullable = true)

>>> df_with_none.show(10, False)
+------------------+
|tree_id           |
+------------------+
|null              |
|235313013750949472|
|null              |
|135313013750949504|
|935313013750949376|
|835313013750949504|
|null              |
|null              |
+------------------+

我在 Whosebug 上搜索过,找不到合适的东西。能否请您指点一下?

谢谢

问题与 Parquet 无关,但与您最初将 row_list 转换为 pandas DataFrame 有关:

row_list = get_row_list()
col_list = ['tree_id']
df = pd.DataFrame(row_list, columns=col_list)

>>> df
        tree_id
0           NaN
1  2.353130e+17
2           NaN
3  1.353130e+17
4  9.353130e+17
5  8.353130e+17
6           NaN
7           NaN

由于缺少值,pandas 创建了一个 float64 列。正是这种 int -> float 转换失去了如此大的整数的精度。
稍后再次将浮点数转换为整数(当使用强制整数列的模式创建 pyarrow Table 时)将导致值略有不同,如在 python 中手动执行此操作所示嗯:

>>> row_list[1]
235313013750949476
>>> df.loc[1, "tree_id"]
2.3531301375094947e+17
>>> int(df.loc[1, "tree_id"])
235313013750949472

一种可能的解决方案是避免使用临时 DataFrame。当然,这将取决于您的确切(真实)用例,但是如果您像上面可重现的示例一样从 python 列表开始,您也可以直接从这个值列表创建一个 pyarrow.Table ( pa.table({"tree_id": row_list}, schema=..) 这将在 Parquet 文件中保留准确的值。