以镶木地板格式存储和检索非常大的数字的问题
Problem with storing and retrieving very large numbers in parquet format
在将具有大量数字(18 位)的数字存储为镶木地板并检索时,我遇到了一个奇怪的问题。我得到不同的值。进一步钻探,看起来只有当输入列表是 None 和实际值的混合时才会出现此问题。当列表没有 None 值时,值会按预期取回。
我认为这与显示问题无关。尝试用 unix 命令显示,如 cat、vi 编辑器等,所以它看起来不像显示问题。
代码中有2个部分,
从包含 None 和大数字的列表创建镶木地板。这就是问题所在。例如:值:235313013750949476 更改为 235313013750949472,如输出所示。
从仅包含大数字且没有 None 值的列表创建镶木地板。它按预期工作。
代码
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def get_row_list():
row_list = []
row_list.append(None)
row_list.append(235313013750949476)
row_list.append(None)
row_list.append(135313013750949496)
row_list.append(935313013750949406)
row_list.append(835313013750949456)
row_list.append(None)
row_list.append(None)
return row_list
def get_row_list_with_no_none():
row_list = []
row_list.append(235313013750949476)
row_list.append(135313013750949496)
row_list.append(935313013750949406)
row_list.append(835313013750949456)
return row_list
def create_parquet(row_list, col_list, parquet_filename):
df = pd.DataFrame(row_list, columns=col_list)
schema_field_list = [('tree_id', pa.int64())]
pa_schema = pa.schema(schema_field_list)
table = pa.Table.from_pandas(df, pa_schema)
pq_writer = pq.ParquetWriter(parquet_filename,
schema=pa_schema)
pq_writer.write_table(table)
pq_writer.close()
print("Parquet file [%s] created" % parquet_filename)
def main():
col_list = ['tree_id']
# Row list without any none
row_list = get_row_list_with_no_none()
print (row_list)
create_parquet(row_list, col_list, 'without_none.parquet')
# Row list with none
row_list = get_row_list()
print (row_list)
create_parquet(row_list, col_list, 'with_none.parquet')
# ==== Main code Execution =====
if __name__ == '__main__':
main()
[执行]
python test-parquet.py
[235313013750949476, 135313013750949496, 935313013750949406, 835313013750949456]
Parquet file [without_none.parquet] created
[None, 235313013750949476, None, 135313013750949496, 935313013750949406, 835313013750949456, None, None]
Parquet file [with_none.parquet] created
[库版本]
pyarrow 5.0.0
pandas 1.1.5
python -v
Python 3.6.6
[通过将镶木地板作为 spark df 进行测试]
>>> dfwithoutnone = spark.read.parquet("s3://some-bucket/without_none.parquet/")
>>> dfwithoutnone.count()
4
>>> dfwithoutnone.printSchema()
root
|-- tree_id: long (nullable = true)
>>> dfwithoutnone.show(10, False)
+------------------+
|tree_id |
+------------------+
|235313013750949476|
|135313013750949496|
|935313013750949406|
|835313013750949456|
+------------------+
>>> df_with_none = spark.read.parquet("s3://some-bucket/with_none.parquet/")
>>> df_with_none.count()
8
>>> df_with_none.printSchema()
root
|-- tree_id: long (nullable = true)
>>> df_with_none.printSchema()
root
|-- tree_id: long (nullable = true)
>>> df_with_none.show(10, False)
+------------------+
|tree_id |
+------------------+
|null |
|235313013750949472|
|null |
|135313013750949504|
|935313013750949376|
|835313013750949504|
|null |
|null |
+------------------+
我在 Whosebug 上搜索过,找不到合适的东西。能否请您指点一下?
谢谢
问题与 Parquet 无关,但与您最初将 row_list
转换为 pandas DataFrame 有关:
row_list = get_row_list()
col_list = ['tree_id']
df = pd.DataFrame(row_list, columns=col_list)
>>> df
tree_id
0 NaN
1 2.353130e+17
2 NaN
3 1.353130e+17
4 9.353130e+17
5 8.353130e+17
6 NaN
7 NaN
由于缺少值,pandas 创建了一个 float64 列。正是这种 int -> float 转换失去了如此大的整数的精度。
稍后再次将浮点数转换为整数(当使用强制整数列的模式创建 pyarrow Table 时)将导致值略有不同,如在 python 中手动执行此操作所示嗯:
>>> row_list[1]
235313013750949476
>>> df.loc[1, "tree_id"]
2.3531301375094947e+17
>>> int(df.loc[1, "tree_id"])
235313013750949472
一种可能的解决方案是避免使用临时 DataFrame。当然,这将取决于您的确切(真实)用例,但是如果您像上面可重现的示例一样从 python 列表开始,您也可以直接从这个值列表创建一个 pyarrow.Table ( pa.table({"tree_id": row_list}, schema=..)
这将在 Parquet 文件中保留准确的值。
在将具有大量数字(18 位)的数字存储为镶木地板并检索时,我遇到了一个奇怪的问题。我得到不同的值。进一步钻探,看起来只有当输入列表是 None 和实际值的混合时才会出现此问题。当列表没有 None 值时,值会按预期取回。
我认为这与显示问题无关。尝试用 unix 命令显示,如 cat、vi 编辑器等,所以它看起来不像显示问题。
代码中有2个部分,
从包含 None 和大数字的列表创建镶木地板。这就是问题所在。例如:值:235313013750949476 更改为 235313013750949472,如输出所示。
从仅包含大数字且没有 None 值的列表创建镶木地板。它按预期工作。
代码
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def get_row_list():
row_list = []
row_list.append(None)
row_list.append(235313013750949476)
row_list.append(None)
row_list.append(135313013750949496)
row_list.append(935313013750949406)
row_list.append(835313013750949456)
row_list.append(None)
row_list.append(None)
return row_list
def get_row_list_with_no_none():
row_list = []
row_list.append(235313013750949476)
row_list.append(135313013750949496)
row_list.append(935313013750949406)
row_list.append(835313013750949456)
return row_list
def create_parquet(row_list, col_list, parquet_filename):
df = pd.DataFrame(row_list, columns=col_list)
schema_field_list = [('tree_id', pa.int64())]
pa_schema = pa.schema(schema_field_list)
table = pa.Table.from_pandas(df, pa_schema)
pq_writer = pq.ParquetWriter(parquet_filename,
schema=pa_schema)
pq_writer.write_table(table)
pq_writer.close()
print("Parquet file [%s] created" % parquet_filename)
def main():
col_list = ['tree_id']
# Row list without any none
row_list = get_row_list_with_no_none()
print (row_list)
create_parquet(row_list, col_list, 'without_none.parquet')
# Row list with none
row_list = get_row_list()
print (row_list)
create_parquet(row_list, col_list, 'with_none.parquet')
# ==== Main code Execution =====
if __name__ == '__main__':
main()
[执行]
python test-parquet.py
[235313013750949476, 135313013750949496, 935313013750949406, 835313013750949456]
Parquet file [without_none.parquet] created
[None, 235313013750949476, None, 135313013750949496, 935313013750949406, 835313013750949456, None, None]
Parquet file [with_none.parquet] created
[库版本]
pyarrow 5.0.0
pandas 1.1.5
python -v
Python 3.6.6
[通过将镶木地板作为 spark df 进行测试]
>>> dfwithoutnone = spark.read.parquet("s3://some-bucket/without_none.parquet/")
>>> dfwithoutnone.count()
4
>>> dfwithoutnone.printSchema()
root
|-- tree_id: long (nullable = true)
>>> dfwithoutnone.show(10, False)
+------------------+
|tree_id |
+------------------+
|235313013750949476|
|135313013750949496|
|935313013750949406|
|835313013750949456|
+------------------+
>>> df_with_none = spark.read.parquet("s3://some-bucket/with_none.parquet/")
>>> df_with_none.count()
8
>>> df_with_none.printSchema()
root
|-- tree_id: long (nullable = true)
>>> df_with_none.printSchema()
root
|-- tree_id: long (nullable = true)
>>> df_with_none.show(10, False)
+------------------+
|tree_id |
+------------------+
|null |
|235313013750949472|
|null |
|135313013750949504|
|935313013750949376|
|835313013750949504|
|null |
|null |
+------------------+
我在 Whosebug 上搜索过,找不到合适的东西。能否请您指点一下?
谢谢
问题与 Parquet 无关,但与您最初将 row_list
转换为 pandas DataFrame 有关:
row_list = get_row_list()
col_list = ['tree_id']
df = pd.DataFrame(row_list, columns=col_list)
>>> df
tree_id
0 NaN
1 2.353130e+17
2 NaN
3 1.353130e+17
4 9.353130e+17
5 8.353130e+17
6 NaN
7 NaN
由于缺少值,pandas 创建了一个 float64 列。正是这种 int -> float 转换失去了如此大的整数的精度。
稍后再次将浮点数转换为整数(当使用强制整数列的模式创建 pyarrow Table 时)将导致值略有不同,如在 python 中手动执行此操作所示嗯:
>>> row_list[1]
235313013750949476
>>> df.loc[1, "tree_id"]
2.3531301375094947e+17
>>> int(df.loc[1, "tree_id"])
235313013750949472
一种可能的解决方案是避免使用临时 DataFrame。当然,这将取决于您的确切(真实)用例,但是如果您像上面可重现的示例一样从 python 列表开始,您也可以直接从这个值列表创建一个 pyarrow.Table ( pa.table({"tree_id": row_list}, schema=..)
这将在 Parquet 文件中保留准确的值。