从箭头格式到 pandas 数据帧的转换是否会重复堆上的数据?
Does conversion from arrow format to pandas dataframe duplicate data on the heap?
我试图找出在从箭头文件读取并转换为 pandas 数据帧时导致内存使用率过高的原因。当我查看堆时,似乎 pandas 数据帧的大小几乎与 numpy 数组相等。使用 guppy hpy().heap():
的示例堆输出
Partition of a set of 351136 objects. Total size = 20112096840 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 121 0 9939601034 49 9939601034 49 numpy.ndarray
1 1 0 9939585700 49 19879186734 99 pandas.core.frame.DataFrame
2 1 0 185786680 1 20064973414 100 pandas.core.indexes.datetimes.DatetimeIndex
我写了一个测试脚本来更好地说明我在说什么,虽然我使用不同的方法来使用转换,但概念是相同的:
import numpy as np
import pandas as pd
import pyarrow as pa
from pyarrow import feather
from guppy import hpy
import psutil
import os
import time
DATA_FILE = 'test.arrow'
process = psutil.Process(os.getpid())
def setup():
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)), columns=list([f'{i}' for i in range(57)]))
mem_size = process.memory_info().rss / 1e9
print(f'before feather {mem_size}gb: \n{hpy().heap()}')
df.to_feather(DATA_FILE)
time.sleep(5)
mem_size = process.memory_info().rss / 1e9
print(f'after writing to feather {mem_size}gb: \n{hpy().heap()}')
print(f'wrote {DATA_FILE}')
import sys
sys.exit()
def foo():
mem_size = process.memory_info().rss / 1e9
path = DATA_FILE
print(f'before reading table {mem_size}gb: \n{hpy().heap()}')
feather_table = feather.read_table(path)
mem_size = process.memory_info().rss / 1e9
print(f'after reading table {mem_size}gb: \n{hpy().heap()}')
df = feather_table.to_pandas()
mem_size = process.memory_info().rss / 1e9
print(f'after converting to pandas {mem_size}gb: \n{hpy().heap()}')
return df
if __name__ == "__main__":
#setup()
df = foo()
time.sleep(5)
mem_size = process.memory_info().rss / 1e9
print(f'final heap {mem_size}gb: \n{hpy().heap()}')
setup()
需要先调用 foo()
.
输出(来自设置):
before feather 3.374010368gb:
Partition of a set of 229931 objects. Total size = 3313572857 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 3281625136 99 3281625136 99 pandas.core.frame.DataFrame
1 59491 26 9902952 0 3291528088 99 str
2 64105 28 5450160 0 3296978248 99 tuple
3 30157 13 2339796 0 3299318044 100 bytes
4 15221 7 2203888 0 3301521932 100 types.CodeType
5 14449 6 2080656 0 3303602588 100 function
6 6674 3 2018224 0 3305620812 100 dict (no owner)
7 1860 1 1539768 0 3307160580 100 type
8 630 0 1158616 0 3308319196 100 dict of module
9 1860 1 1078064 0 3309397260 100 dict of type
<616 more rows. Type e.g. '_.more' to view.>
after writing to feather 3.40015104gb:
Partition of a set of 230564 objects. Total size = 6595283738 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 57 0 3281634096 50 3281634096 50 pandas.core.series.Series
1 1 0 3281625136 50 6563259232 100 pandas.core.frame.DataFrame
2 59548 26 9905849 0 6573165081 100 str
3 64073 28 5445176 0 6578610257 100 tuple
4 30153 13 2339608 0 6580949865 100 bytes
5 15219 7 2203600 0 6583153465 100 types.CodeType
6 6845 3 2064024 0 6585217489 100 dict (no owner)
7 14304 6 2059776 0 6587277265 100 function
8 1860 1 1540224 0 6588817489 100 type
9 630 0 1158616 0 6589976105 100 dict of module
<627 more rows. Type e.g. '_.more' to view.>
wrote test.arrow
输出(正常运行无设置):
before reading table 0.092004352gb:
Partition of a set of 229908 objects. Total size = 31941164 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 59491 26 9902952 31 9902952 31 str
1 64104 28 5450096 17 15353048 48 tuple
2 30157 13 2339788 7 17692836 55 bytes
3 15221 7 2203888 7 19896724 62 types.CodeType
4 14449 6 2080656 7 21977380 69 function
5 6669 3 2016984 6 23994364 75 dict (no owner)
6 1860 1 1539768 5 25534132 80 type
7 630 0 1158616 4 26692748 84 dict of module
8 1860 1 1078064 3 27770812 87 dict of type
9 1979 1 490792 2 28261604 88 dict of function
<605 more rows. Type e.g. '_.more' to view.>
after reading table 3.512406016gb:
Partition of a set of 229383 objects. Total size = 3313510008 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 3281625032 99 3281625032 99 pyarrow.lib.Table
1 59491 26 9902952 0 3291527984 99 str
2 63952 28 5436848 0 3296964832 100 tuple
3 30153 13 2339600 0 3299304432 100 bytes
4 15219 7 2203600 0 3301508032 100 types.CodeType
5 14303 6 2059632 0 3303567664 100 function
6 6669 3 2016984 0 3305584648 100 dict (no owner)
7 1860 1 1539768 0 3307124416 100 type
8 630 0 1158616 0 3308283032 100 dict of module
9 1860 1 1078064 0 3309361096 100 dict of type
<604 more rows. Type e.g. '_.more' to view.>
after converting to pandas 6.797561856gb:
Partition of a set of 229432 objects. Total size = 6595149289 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 3281625136 50 3281625136 50 pandas.core.frame.DataFrame
1 1 0 3281625032 50 6563250168 100 pyarrow.lib.Table
2 59491 26 9902952 0 6573153120 100 str
3 63965 28 5437856 0 6578590976 100 tuple
4 30153 13 2339600 0 6580930576 100 bytes
5 15219 7 2203600 0 6583134176 100 types.CodeType
6 14303 6 2059632 0 6585193808 100 function
7 6673 3 2020016 0 6587213824 100 dict (no owner)
8 1860 1 1540264 0 6588754088 100 type
9 630 0 1158616 0 6589912704 100 dict of module
<618 more rows. Type e.g. '_.more' to view.>
final heap 6.79968768gb:
Partition of a set of 230570 objects. Total size = 6595283554 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 57 0 3281634096 50 3281634096 50 pandas.core.series.Series
1 1 0 3281625136 50 6563259232 100 pandas.core.frame.DataFrame
2 59538 26 9905349 0 6573164581 100 str
3 64080 28 5445672 0 6578610253 100 tuple
4 30153 13 2339600 0 6580949853 100 bytes
5 15219 7 2203600 0 6583153453 100 types.CodeType
6 6844 3 2062552 0 6585216005 100 dict (no owner)
7 14304 6 2059776 0 6587275781 100 function
8 1860 1 1540264 0 6588816045 100 type
9 630 0 1159152 0 6589975197 100 dict of module
<627 more rows. Type e.g. '_.more' to view.>
数据帧似乎在 pd.Series 表示的堆上有一个副本。 dataframe刚创建的时候是没有的,只有写入arrow/feather文件的时候才有。一旦我们阅读了这个文件,这些系列 return 并且与预期的数据帧大小相同。
Does conversion from arrow format to pandas dataframe duplicate data on the heap?
文档很好地解释了正在发生的事情:https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy
在你的情况下,数据确实被复制了。在某些情况下,您无需复制数据即可逃脱。
但我无法理解 guppy 的输出。例如,在最后的堆中,当箭头 table 超出范围时,看起来有两个数据副本(一个在 DataFrame 中,一个在 57 系列中),而实际上我只期望 3gb。
pyarrow、pandas 和 numpy 对同一底层内存都有不同的看法。孔雀鱼似乎无法识别这一点(我想这样做会很困难)。所以它似乎是重复计算。这是一个简单的例子:
import numpy as np
import os
import psutil
import pyarrow as pa
from guppy import hpy
process = psutil.Process(os.getpid())
# Will consume ~800MB of RAM
x = np.random.rand(100000000)
print(hpy().heap())
# Partition of a set of 98412 objects. Total size = 813400879 bytes.
print(process.memory_info().rss)
# 855588864
# This is a zero-copy operation. Note
# that RSS remains consistent. Both x
# and arr reference the same underlying
# array of doubles.
arr = pa.array(x)
print(hpy().heap())
# Partition of a set of 211452 objects. Total size = 1629410271 bytes.
print(process.memory_info().rss)
# 891699200
我试图找出在从箭头文件读取并转换为 pandas 数据帧时导致内存使用率过高的原因。当我查看堆时,似乎 pandas 数据帧的大小几乎与 numpy 数组相等。使用 guppy hpy().heap():
的示例堆输出Partition of a set of 351136 objects. Total size = 20112096840 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 121 0 9939601034 49 9939601034 49 numpy.ndarray
1 1 0 9939585700 49 19879186734 99 pandas.core.frame.DataFrame
2 1 0 185786680 1 20064973414 100 pandas.core.indexes.datetimes.DatetimeIndex
我写了一个测试脚本来更好地说明我在说什么,虽然我使用不同的方法来使用转换,但概念是相同的:
import numpy as np
import pandas as pd
import pyarrow as pa
from pyarrow import feather
from guppy import hpy
import psutil
import os
import time
DATA_FILE = 'test.arrow'
process = psutil.Process(os.getpid())
def setup():
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)), columns=list([f'{i}' for i in range(57)]))
mem_size = process.memory_info().rss / 1e9
print(f'before feather {mem_size}gb: \n{hpy().heap()}')
df.to_feather(DATA_FILE)
time.sleep(5)
mem_size = process.memory_info().rss / 1e9
print(f'after writing to feather {mem_size}gb: \n{hpy().heap()}')
print(f'wrote {DATA_FILE}')
import sys
sys.exit()
def foo():
mem_size = process.memory_info().rss / 1e9
path = DATA_FILE
print(f'before reading table {mem_size}gb: \n{hpy().heap()}')
feather_table = feather.read_table(path)
mem_size = process.memory_info().rss / 1e9
print(f'after reading table {mem_size}gb: \n{hpy().heap()}')
df = feather_table.to_pandas()
mem_size = process.memory_info().rss / 1e9
print(f'after converting to pandas {mem_size}gb: \n{hpy().heap()}')
return df
if __name__ == "__main__":
#setup()
df = foo()
time.sleep(5)
mem_size = process.memory_info().rss / 1e9
print(f'final heap {mem_size}gb: \n{hpy().heap()}')
setup()
需要先调用 foo()
.
输出(来自设置):
before feather 3.374010368gb:
Partition of a set of 229931 objects. Total size = 3313572857 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 3281625136 99 3281625136 99 pandas.core.frame.DataFrame
1 59491 26 9902952 0 3291528088 99 str
2 64105 28 5450160 0 3296978248 99 tuple
3 30157 13 2339796 0 3299318044 100 bytes
4 15221 7 2203888 0 3301521932 100 types.CodeType
5 14449 6 2080656 0 3303602588 100 function
6 6674 3 2018224 0 3305620812 100 dict (no owner)
7 1860 1 1539768 0 3307160580 100 type
8 630 0 1158616 0 3308319196 100 dict of module
9 1860 1 1078064 0 3309397260 100 dict of type
<616 more rows. Type e.g. '_.more' to view.>
after writing to feather 3.40015104gb:
Partition of a set of 230564 objects. Total size = 6595283738 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 57 0 3281634096 50 3281634096 50 pandas.core.series.Series
1 1 0 3281625136 50 6563259232 100 pandas.core.frame.DataFrame
2 59548 26 9905849 0 6573165081 100 str
3 64073 28 5445176 0 6578610257 100 tuple
4 30153 13 2339608 0 6580949865 100 bytes
5 15219 7 2203600 0 6583153465 100 types.CodeType
6 6845 3 2064024 0 6585217489 100 dict (no owner)
7 14304 6 2059776 0 6587277265 100 function
8 1860 1 1540224 0 6588817489 100 type
9 630 0 1158616 0 6589976105 100 dict of module
<627 more rows. Type e.g. '_.more' to view.>
wrote test.arrow
输出(正常运行无设置):
before reading table 0.092004352gb:
Partition of a set of 229908 objects. Total size = 31941164 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 59491 26 9902952 31 9902952 31 str
1 64104 28 5450096 17 15353048 48 tuple
2 30157 13 2339788 7 17692836 55 bytes
3 15221 7 2203888 7 19896724 62 types.CodeType
4 14449 6 2080656 7 21977380 69 function
5 6669 3 2016984 6 23994364 75 dict (no owner)
6 1860 1 1539768 5 25534132 80 type
7 630 0 1158616 4 26692748 84 dict of module
8 1860 1 1078064 3 27770812 87 dict of type
9 1979 1 490792 2 28261604 88 dict of function
<605 more rows. Type e.g. '_.more' to view.>
after reading table 3.512406016gb:
Partition of a set of 229383 objects. Total size = 3313510008 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 3281625032 99 3281625032 99 pyarrow.lib.Table
1 59491 26 9902952 0 3291527984 99 str
2 63952 28 5436848 0 3296964832 100 tuple
3 30153 13 2339600 0 3299304432 100 bytes
4 15219 7 2203600 0 3301508032 100 types.CodeType
5 14303 6 2059632 0 3303567664 100 function
6 6669 3 2016984 0 3305584648 100 dict (no owner)
7 1860 1 1539768 0 3307124416 100 type
8 630 0 1158616 0 3308283032 100 dict of module
9 1860 1 1078064 0 3309361096 100 dict of type
<604 more rows. Type e.g. '_.more' to view.>
after converting to pandas 6.797561856gb:
Partition of a set of 229432 objects. Total size = 6595149289 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 3281625136 50 3281625136 50 pandas.core.frame.DataFrame
1 1 0 3281625032 50 6563250168 100 pyarrow.lib.Table
2 59491 26 9902952 0 6573153120 100 str
3 63965 28 5437856 0 6578590976 100 tuple
4 30153 13 2339600 0 6580930576 100 bytes
5 15219 7 2203600 0 6583134176 100 types.CodeType
6 14303 6 2059632 0 6585193808 100 function
7 6673 3 2020016 0 6587213824 100 dict (no owner)
8 1860 1 1540264 0 6588754088 100 type
9 630 0 1158616 0 6589912704 100 dict of module
<618 more rows. Type e.g. '_.more' to view.>
final heap 6.79968768gb:
Partition of a set of 230570 objects. Total size = 6595283554 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 57 0 3281634096 50 3281634096 50 pandas.core.series.Series
1 1 0 3281625136 50 6563259232 100 pandas.core.frame.DataFrame
2 59538 26 9905349 0 6573164581 100 str
3 64080 28 5445672 0 6578610253 100 tuple
4 30153 13 2339600 0 6580949853 100 bytes
5 15219 7 2203600 0 6583153453 100 types.CodeType
6 6844 3 2062552 0 6585216005 100 dict (no owner)
7 14304 6 2059776 0 6587275781 100 function
8 1860 1 1540264 0 6588816045 100 type
9 630 0 1159152 0 6589975197 100 dict of module
<627 more rows. Type e.g. '_.more' to view.>
数据帧似乎在 pd.Series 表示的堆上有一个副本。 dataframe刚创建的时候是没有的,只有写入arrow/feather文件的时候才有。一旦我们阅读了这个文件,这些系列 return 并且与预期的数据帧大小相同。
Does conversion from arrow format to pandas dataframe duplicate data on the heap?
文档很好地解释了正在发生的事情:https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy
在你的情况下,数据确实被复制了。在某些情况下,您无需复制数据即可逃脱。
但我无法理解 guppy 的输出。例如,在最后的堆中,当箭头 table 超出范围时,看起来有两个数据副本(一个在 DataFrame 中,一个在 57 系列中),而实际上我只期望 3gb。
pyarrow、pandas 和 numpy 对同一底层内存都有不同的看法。孔雀鱼似乎无法识别这一点(我想这样做会很困难)。所以它似乎是重复计算。这是一个简单的例子:
import numpy as np
import os
import psutil
import pyarrow as pa
from guppy import hpy
process = psutil.Process(os.getpid())
# Will consume ~800MB of RAM
x = np.random.rand(100000000)
print(hpy().heap())
# Partition of a set of 98412 objects. Total size = 813400879 bytes.
print(process.memory_info().rss)
# 855588864
# This is a zero-copy operation. Note
# that RSS remains consistent. Both x
# and arr reference the same underlying
# array of doubles.
arr = pa.array(x)
print(hpy().heap())
# Partition of a set of 211452 objects. Total size = 1629410271 bytes.
print(process.memory_info().rss)
# 891699200