从箭头格式到 pandas 数据帧的转换是否会重复堆上的数据?

Does conversion from arrow format to pandas dataframe duplicate data on the heap?

我试图找出在从箭头文件读取并转换为 pandas 数据帧时导致内存使用率过高的原因。当我查看堆时,似乎 pandas 数据帧的大小几乎与 numpy 数组相等。使用 guppy hpy().heap():

的示例堆输出
Partition of a set of 351136 objects. Total size = 20112096840 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0    121   0 9939601034  49 9939601034  49 numpy.ndarray
     1      1   0 9939585700  49 19879186734  99 pandas.core.frame.DataFrame
     2      1   0 185786680   1 20064973414 100 pandas.core.indexes.datetimes.DatetimeIndex

我写了一个测试脚本来更好地说明我在说什么,虽然我使用不同的方法来使用转换,但概念是相同的:

import numpy as np
import pandas as pd
import pyarrow as pa
from pyarrow import feather
from guppy import hpy
import psutil
import os
import time

DATA_FILE = 'test.arrow'
process = psutil.Process(os.getpid()) 

def setup():
  np.random.seed(0)
  df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)), columns=list([f'{i}' for i in range(57)]))
  mem_size = process.memory_info().rss / 1e9
  print(f'before feather {mem_size}gb: \n{hpy().heap()}')
  df.to_feather(DATA_FILE)
  time.sleep(5)
  mem_size = process.memory_info().rss / 1e9
  print(f'after writing to feather {mem_size}gb: \n{hpy().heap()}')
  print(f'wrote {DATA_FILE}')
  import sys
  sys.exit()

def foo():
  mem_size = process.memory_info().rss / 1e9
  path = DATA_FILE
  print(f'before reading table {mem_size}gb: \n{hpy().heap()}')
  feather_table = feather.read_table(path)
  mem_size = process.memory_info().rss / 1e9
  print(f'after reading table {mem_size}gb: \n{hpy().heap()}')
  df = feather_table.to_pandas()
  mem_size = process.memory_info().rss / 1e9
  print(f'after converting to pandas {mem_size}gb: \n{hpy().heap()}')
  return df

if __name__ == "__main__":
  #setup()
  df = foo()
  time.sleep(5)
  mem_size = process.memory_info().rss / 1e9
  print(f'final heap {mem_size}gb: \n{hpy().heap()}')

setup() 需要先调用 foo().

输出(来自设置):

before feather 3.374010368gb:
Partition of a set of 229931 objects. Total size = 3313572857 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 3281625136  99 3281625136  99 pandas.core.frame.DataFrame
     1  59491  26  9902952   0 3291528088  99 str
     2  64105  28  5450160   0 3296978248  99 tuple
     3  30157  13  2339796   0 3299318044 100 bytes
     4  15221   7  2203888   0 3301521932 100 types.CodeType
     5  14449   6  2080656   0 3303602588 100 function
     6   6674   3  2018224   0 3305620812 100 dict (no owner)
     7   1860   1  1539768   0 3307160580 100 type
     8    630   0  1158616   0 3308319196 100 dict of module
     9   1860   1  1078064   0 3309397260 100 dict of type
<616 more rows. Type e.g. '_.more' to view.>
after writing to feather 3.40015104gb:
Partition of a set of 230564 objects. Total size = 6595283738 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0     57   0 3281634096  50 3281634096  50 pandas.core.series.Series
     1      1   0 3281625136  50 6563259232 100 pandas.core.frame.DataFrame
     2  59548  26  9905849   0 6573165081 100 str
     3  64073  28  5445176   0 6578610257 100 tuple
     4  30153  13  2339608   0 6580949865 100 bytes
     5  15219   7  2203600   0 6583153465 100 types.CodeType
     6   6845   3  2064024   0 6585217489 100 dict (no owner)
     7  14304   6  2059776   0 6587277265 100 function
     8   1860   1  1540224   0 6588817489 100 type
     9    630   0  1158616   0 6589976105 100 dict of module
<627 more rows. Type e.g. '_.more' to view.>
wrote test.arrow

输出(正常运行无设置):

before reading table 0.092004352gb:
Partition of a set of 229908 objects. Total size = 31941164 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  59491  26  9902952  31   9902952  31 str
     1  64104  28  5450096  17  15353048  48 tuple
     2  30157  13  2339788   7  17692836  55 bytes
     3  15221   7  2203888   7  19896724  62 types.CodeType
     4  14449   6  2080656   7  21977380  69 function
     5   6669   3  2016984   6  23994364  75 dict (no owner)
     6   1860   1  1539768   5  25534132  80 type
     7    630   0  1158616   4  26692748  84 dict of module
     8   1860   1  1078064   3  27770812  87 dict of type
     9   1979   1   490792   2  28261604  88 dict of function
<605 more rows. Type e.g. '_.more' to view.>
after reading table 3.512406016gb:
Partition of a set of 229383 objects. Total size = 3313510008 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 3281625032  99 3281625032  99 pyarrow.lib.Table
     1  59491  26  9902952   0 3291527984  99 str
     2  63952  28  5436848   0 3296964832 100 tuple
     3  30153  13  2339600   0 3299304432 100 bytes
     4  15219   7  2203600   0 3301508032 100 types.CodeType
     5  14303   6  2059632   0 3303567664 100 function
     6   6669   3  2016984   0 3305584648 100 dict (no owner)
     7   1860   1  1539768   0 3307124416 100 type
     8    630   0  1158616   0 3308283032 100 dict of module
     9   1860   1  1078064   0 3309361096 100 dict of type
<604 more rows. Type e.g. '_.more' to view.>
after converting to pandas 6.797561856gb:
Partition of a set of 229432 objects. Total size = 6595149289 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 3281625136  50 3281625136  50 pandas.core.frame.DataFrame
     1      1   0 3281625032  50 6563250168 100 pyarrow.lib.Table
     2  59491  26  9902952   0 6573153120 100 str
     3  63965  28  5437856   0 6578590976 100 tuple
     4  30153  13  2339600   0 6580930576 100 bytes
     5  15219   7  2203600   0 6583134176 100 types.CodeType
     6  14303   6  2059632   0 6585193808 100 function
     7   6673   3  2020016   0 6587213824 100 dict (no owner)
     8   1860   1  1540264   0 6588754088 100 type
     9    630   0  1158616   0 6589912704 100 dict of module
<618 more rows. Type e.g. '_.more' to view.>
final heap 6.79968768gb:
Partition of a set of 230570 objects. Total size = 6595283554 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0     57   0 3281634096  50 3281634096  50 pandas.core.series.Series
     1      1   0 3281625136  50 6563259232 100 pandas.core.frame.DataFrame
     2  59538  26  9905349   0 6573164581 100 str
     3  64080  28  5445672   0 6578610253 100 tuple
     4  30153  13  2339600   0 6580949853 100 bytes
     5  15219   7  2203600   0 6583153453 100 types.CodeType
     6   6844   3  2062552   0 6585216005 100 dict (no owner)
     7  14304   6  2059776   0 6587275781 100 function
     8   1860   1  1540264   0 6588816045 100 type
     9    630   0  1159152   0 6589975197 100 dict of module
<627 more rows. Type e.g. '_.more' to view.>

数据帧似乎在 pd.Series 表示的堆上有一个副本。 dataframe刚创建的时候是没有的,只有写入arrow/feather文件的时候才有。一旦我们阅读了这个文件,这些系列 return 并且与预期的数据帧大小相同。

Does conversion from arrow format to pandas dataframe duplicate data on the heap?

文档很好地解释了正在发生的事情:https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy

在你的情况下,数据确实被复制了。在某些情况下,您无需复制数据即可逃脱。

但我无法理解 guppy 的输出。例如,在最后的堆中,当箭头 table 超出范围时,看起来有两个数据副本(一个在 DataFrame 中,一个在 57 系列中),而实际上我只期望 3gb。

pyarrow、pandas 和 numpy 对同一底层内存都有不同的看法。孔雀鱼似乎无法识别这一点(我想这样做会很困难)。所以它似乎是重复计算。这是一个简单的例子:

import numpy as np
import os
import psutil
import pyarrow as pa
from guppy import hpy

process = psutil.Process(os.getpid())

# Will consume ~800MB of RAM                                                                                                                                                                                       
x = np.random.rand(100000000)
print(hpy().heap())
# Partition of a set of 98412 objects. Total size = 813400879 bytes.
print(process.memory_info().rss)
# 855588864

# This is a zero-copy operation.  Note                                                                                                                                                                             
# that RSS remains consistent.  Both x                                                                                                                                                                             
# and arr reference the same underlying                                                                                                                                                                            
# array of doubles.                                                                                                                                                                                                
arr = pa.array(x)
print(hpy().heap())
# Partition of a set of 211452 objects. Total size = 1629410271 bytes.
print(process.memory_info().rss)
# 891699200