为什么同样的数据,Series 的内存占用是 DataFrame 的 1.5 倍?

Why the memory usage of Series is about 1.5x of DataFrame with the same data?

代码如下:

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: from itertools import product
In [4]: index = list(map(''.join, product(*['ABCDEFGH']*4)))
In [5]: columns = list(map(''.join, product(*['xyzuvw']*3)))

In [6]: df = pd.DataFrame(np.random.randn(len(index), len(columns)), index=index, columns=columns)
In [7]: ser = df.stack()
In [8]: df.memory_usage().sum()
Out[8]: 7274496

In [10]: ser.memory_usage()
Out[10]: 10651360

In [11]: ser.memory_usage() / df.memory_usage().sum()
Out[11]: 1.4642059051238738

In [12]: df.to_hdf('f:/f1.h5', 'df')
In [13]: ser.to_hdf('f:/f2.h5', 'ser')
In [14]: import os

In [15]: os.stat('f:/f2.h5').st_size / os.stat('f:/f1.h5').st_size
Out[15]: 1.498167701758398

以及pandas的版本信息:

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1

您的系列文章被 MultiIndex 编入索引,它占用了很多 space。 ser.reset_index(drop = True).memory_usage(deep = True) returns 7077968.