在 pandas DataFrame 中复制或分组时如何保持主要顺序?
How to keep major-order when copying or groupby-ing a pandas DataFrame?
我如何按顺序使用或操作 (monkey-patch) pandas,以便在复制和 groupby 聚合的结果对象上始终保持相同的主顺序?
我使用 pandas.DataFrame
作为业务应用程序(风险模型)中的数据结构,需要快速聚合多维数据。 pandas 的聚合主要取决于底层 numpy 数组上使用的主要排序方案。
不幸的是,pandas(版本 0.23.4)在我创建副本或使用 groupby 和 sum 执行聚合时更改了底层 numpy 数组的主顺序。
影响是:
案例 1:17.2 秒
案例 2:5 分 46 秒秒
在一个 DataFrame 及其具有 45023 行和 100000 列的副本上。对索引执行聚合。该索引是具有 15 个级别的 pd.MultiIndex
。聚合保持三个级别并导致大约 239 个组。
我通常处理具有 45000 行和 100000 列的 DataFrame。在这一行我有一个 pandas.MultiIndex
大约有 15 个级别。要计算各种层次结构节点的统计信息,我需要在索引维度上聚合(求和)。
聚合速度很快,如果底层 numpy 数组是 c_contiguous
,因此按列优先顺序(C 顺序)保存。如果它是 f_contiguous
,则非常慢,因此是行优先顺序(F 顺序)。
不幸的是,pandas 在
时将大调从 C 更改为 F
创建一个 DataFrame 的副本,甚至
通过 grouby 执行 聚合,然后对石斑鱼求和。因此,生成的 DataFrame 具有不同的主顺序 (!)
当然,我可以坚持使用另一个 'datamodel',只需将 MultiIndex 保留在列上即可。那么当前的 pandas 版本将始终对我有利。但这是不行的。我认为,可以预期,对于正在考虑的两个操作(groupby-sum 和复制),不应更改主顺序。
import numpy as np
import pandas as pd
print("pandas version: ", pd.__version__)
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array.flags
print("Numpy array is C-contiguous: ", data.flags.c_contiguous)
dataframe = pd.DataFrame(array, index = pd.MultiIndex.from_tuples([('A', 'U'), ('A', 'V'), ('B', 'W')], names=['dim_one', 'dim_two']))
print("DataFrame is C-contiguous: ", dataframe.values.flags.c_contiguous)
dataframe_copy = dataframe.copy()
print("Copy of DataFrame is C-contiguous: ", dataframe_copy.values.flags.c_contiguous)
aggregated_dataframe = dataframe.groupby('dim_one').sum()
print("Aggregated DataFrame is C-contiguous: ", aggregated_dataframe.values.flags.c_contiguous)
## Output in Jupyter Notebook
# pandas version: 0.23.4
# Numpy array is C-contiguous: True
# DataFrame is C-contiguous: True
# Copy of DataFrame is C-contiguous: False
# Aggregated DataFrame is C-contiguous: False
应保留数据的主要顺序。如果 pandas 喜欢切换到隐式首选项,那么它应该允许覆盖它。 Numpy允许在创建副本时输入顺序
pandas 的补丁版本应该会导致
## Output in Jupyter Notebook
# pandas version: 0.23.4
# Numpy array is C-contiguous: True
# DataFrame is C-contiguous: True
# Copy of DataFrame is C-contiguous: True
# Aggregated DataFrame is C-contiguous: True
上面截取的示例代码。
Pandas 的 Monkey 补丁(0.23.4,也许还有其他版本)
我创建了一个补丁,想与您分享。它导致上述问题中提到的性能提升。
它适用于 pandas 版本 0.23.4。对于其他版本,您需要尝试它是否仍然有效。
需要以下两个模块,您可以根据放置它们的位置调整导入。
memory_layout.py
memory.py
要修补您的代码,您只需在程序或笔记本的最开头导入以下内容并设置内存布局参数。它会猴子修补 pandas 并确保 DataFrames 的副本具有请求的布局。
from memory_layout import memory_layout
# memory_layout.order = 'F' # assert F-order on copy
# memory_layout.order = 'K' # Keep given layout on copy
memory_layout.order = 'C' # assert C-order on copy
memory_layout.py
创建包含以下内容的文件 memory_layout.py。
import numpy as np
from pandas.core.internals import Block
from memory import memory_layout
# memory_layout.order = 'F' # set memory layout order to 'F' for np.ndarrays in DataFrame copies (fortran/row order)
# memory_layout.order = 'K' # keep memory layout order for np.ndarrays in DataFrame copies (order out is order in)
memory_layout.order = 'C' # set memory layout order to 'C' for np.ndarrays in DataFrame copies (C/column order)
def copy(self, deep=True, mgr=None):
"""
Copy patch on Blocks to set or keep the memory layout
on copies.
:param self: `pandas.core.internals.Block`
:param deep: `bool`
:param mgr: `BlockManager`
:return: copy of `pandas.core.internals.Block`
"""
values = self.values
if deep:
if isinstance(values, np.ndarray):
memory_layout))
values = memory_layout.copy_transposed(values)
memory_layout))
else:
values = values.copy()
return self.make_block_same_class(values)
Block.copy = copy # Block for pandas 0.23.4: in pandas.core.internals.Block
memory.py
使用以下内容创建文件 memory.py。
"""
Implements MemoryLayout copy factory to change memory layout
of `numpy.ndarrays`.
Depending on the use case, operations on DataFrames can be much
faster if the appropriate memory layout is set and preserved.
The implementation allows for changing the desired layout. Changes apply when
copies or new objects are created, as for example, when slicing or aggregating
via groupby ...
This implementation tries to solve the issue raised on GitHub
https://github.com/pandas-dev/pandas/issues/26502
"""
import numpy as np
_DEFAULT_MEMORY_LAYOUT = 'K'
class MemoryLayout(object):
"""
Memory layout management for numpy.ndarrays.
Singleton implementation.
Example:
>>> from memory import memory_layout
>>> memory_layout.order = 'K' #
>>> # K ... keep array layout from input
>>> # C ... set to c-contiguous / column order
>>> # F ... set to f-contiguous / row order
>>> array = memory_layout.apply(array)
>>> array = memory_layout.apply(array, 'C')
>>> array = memory_layout.copy(array)
>>> array = memory_layout.apply_on_transpose(array)
"""
_order = _DEFAULT_MEMORY_LAYOUT
_instance = None
@property
def order(self):
"""
Return memory layout ordering.
:return: `str`
"""
if self.__class__._order is None:
raise AssertionError("Array layout order not set.")
return self.__class__._order
@order.setter
def order(self, order):
"""
Set memory layout order.
Allowed values are 'C', 'F', and 'K'. Raises AssertionError
when trying to set other values.
:param order: `str`
:return: `None`
"""
assert order in ['C', 'F', 'K'], "Only 'C', 'F' and 'K' supported."
self.__class__._order = order
def __new__(cls):
"""
Create only one instance throughout the lifetime of this process.
:return: `MemoryLayout` instance as singleton
"""
if cls._instance is None:
cls._instance = super(MemoryLayout, cls).__new__(MemoryLayout)
return cls._instance
@staticmethod
def get_from(array):
"""
Get memory layout from array
Possible values:
'C' ... only C-contiguous or column order
'F' ... only F-contiguous or row order
'O' ... other: both, C- and F-contiguous or both
not C- or F-contiguous (as on empty arrays).
:param array: `numpy.ndarray`
:return: `str`
"""
if array.flags.c_contiguous == array.flags.f_contiguous:
return 'O'
return {True: 'C', False: 'F'}[array.flags.c_contiguous]
def apply(self, array, order=None):
"""
Apply the order set or the order given as input on the array
given as input.
Possible values:
'C' ... apply C-contiguous layout or column order
'F' ... apply F-contiguous layout or row order
'K' ... keep the given layout
:param array: `numpy.ndarray`
:param order: `str`
:return: `np.ndarray`
"""
order = self.__class__._order if order is None else order
if order == 'K':
return array
array_order = MemoryLayout.get_from(array)
if array_order == order:
return array
return np.reshape(np.ravel(array), array.shape, order=order)
def copy(self, array, order=None):
"""
Return a copy of the input array with the memory layout set.
Layout set:
'C' ... return C-contiguous copy
'F' ... return F-contiguous copy
'K' ... return copy with same layout as
given by the input array.
:param array: `np.ndarray`
:return: `np.ndarray`
"""
order = order if order is not None else self.__class__._order
return array.copy(order=self.get_from(array)) if order == 'K' \
else array.copy(order=order)
def copy_transposed(self, array):
"""
Return a copy of the input array in order that its transpose
has the memory layout set.
Note: numpy simply changes the memory layout from row to column
order instead of reshuffling the data in memory.
Layout set:
'C' ... return F-contiguous copy
'F' ... return C-contiguous copy
'K' ... return copy with oposite (C versus F) layout as
given by the input array.
:param array: `np.ndarray`
:return: `np.ndarray`
:param array:
:return:
"""
if self.__class__._order == 'K':
return array.copy(
order={'C': 'C', 'F': 'F', 'O': None}[self.get_from(array)])
else:
return array.copy(
order={'C': 'F', 'F': 'C'}[self.__class__._order])
def __str__(self):
return str(self.__class__._order)
memory_layout = MemoryLayout() # Singleton
我如何按顺序使用或操作 (monkey-patch) pandas,以便在复制和 groupby 聚合的结果对象上始终保持相同的主顺序?
我使用 pandas.DataFrame
作为业务应用程序(风险模型)中的数据结构,需要快速聚合多维数据。 pandas 的聚合主要取决于底层 numpy 数组上使用的主要排序方案。
不幸的是,pandas(版本 0.23.4)在我创建副本或使用 groupby 和 sum 执行聚合时更改了底层 numpy 数组的主顺序。
影响是:
案例 1:17.2 秒
案例 2:5 分 46 秒秒
在一个 DataFrame 及其具有 45023 行和 100000 列的副本上。对索引执行聚合。该索引是具有 15 个级别的 pd.MultiIndex
。聚合保持三个级别并导致大约 239 个组。
我通常处理具有 45000 行和 100000 列的 DataFrame。在这一行我有一个 pandas.MultiIndex
大约有 15 个级别。要计算各种层次结构节点的统计信息,我需要在索引维度上聚合(求和)。
聚合速度很快,如果底层 numpy 数组是 c_contiguous
,因此按列优先顺序(C 顺序)保存。如果它是 f_contiguous
,则非常慢,因此是行优先顺序(F 顺序)。
不幸的是,pandas 在
时将大调从 C 更改为 F创建一个 DataFrame 的副本,甚至
通过 grouby 执行 聚合,然后对石斑鱼求和。因此,生成的 DataFrame 具有不同的主顺序 (!)
当然,我可以坚持使用另一个 'datamodel',只需将 MultiIndex 保留在列上即可。那么当前的 pandas 版本将始终对我有利。但这是不行的。我认为,可以预期,对于正在考虑的两个操作(groupby-sum 和复制),不应更改主顺序。
import numpy as np
import pandas as pd
print("pandas version: ", pd.__version__)
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array.flags
print("Numpy array is C-contiguous: ", data.flags.c_contiguous)
dataframe = pd.DataFrame(array, index = pd.MultiIndex.from_tuples([('A', 'U'), ('A', 'V'), ('B', 'W')], names=['dim_one', 'dim_two']))
print("DataFrame is C-contiguous: ", dataframe.values.flags.c_contiguous)
dataframe_copy = dataframe.copy()
print("Copy of DataFrame is C-contiguous: ", dataframe_copy.values.flags.c_contiguous)
aggregated_dataframe = dataframe.groupby('dim_one').sum()
print("Aggregated DataFrame is C-contiguous: ", aggregated_dataframe.values.flags.c_contiguous)
## Output in Jupyter Notebook
# pandas version: 0.23.4
# Numpy array is C-contiguous: True
# DataFrame is C-contiguous: True
# Copy of DataFrame is C-contiguous: False
# Aggregated DataFrame is C-contiguous: False
应保留数据的主要顺序。如果 pandas 喜欢切换到隐式首选项,那么它应该允许覆盖它。 Numpy允许在创建副本时输入顺序
pandas 的补丁版本应该会导致
## Output in Jupyter Notebook
# pandas version: 0.23.4
# Numpy array is C-contiguous: True
# DataFrame is C-contiguous: True
# Copy of DataFrame is C-contiguous: True
# Aggregated DataFrame is C-contiguous: True
上面截取的示例代码。
Pandas 的 Monkey 补丁(0.23.4,也许还有其他版本)
我创建了一个补丁,想与您分享。它导致上述问题中提到的性能提升。
它适用于 pandas 版本 0.23.4。对于其他版本,您需要尝试它是否仍然有效。
需要以下两个模块,您可以根据放置它们的位置调整导入。
memory_layout.py
memory.py
要修补您的代码,您只需在程序或笔记本的最开头导入以下内容并设置内存布局参数。它会猴子修补 pandas 并确保 DataFrames 的副本具有请求的布局。
from memory_layout import memory_layout
# memory_layout.order = 'F' # assert F-order on copy
# memory_layout.order = 'K' # Keep given layout on copy
memory_layout.order = 'C' # assert C-order on copy
memory_layout.py
创建包含以下内容的文件 memory_layout.py。
import numpy as np
from pandas.core.internals import Block
from memory import memory_layout
# memory_layout.order = 'F' # set memory layout order to 'F' for np.ndarrays in DataFrame copies (fortran/row order)
# memory_layout.order = 'K' # keep memory layout order for np.ndarrays in DataFrame copies (order out is order in)
memory_layout.order = 'C' # set memory layout order to 'C' for np.ndarrays in DataFrame copies (C/column order)
def copy(self, deep=True, mgr=None):
"""
Copy patch on Blocks to set or keep the memory layout
on copies.
:param self: `pandas.core.internals.Block`
:param deep: `bool`
:param mgr: `BlockManager`
:return: copy of `pandas.core.internals.Block`
"""
values = self.values
if deep:
if isinstance(values, np.ndarray):
memory_layout))
values = memory_layout.copy_transposed(values)
memory_layout))
else:
values = values.copy()
return self.make_block_same_class(values)
Block.copy = copy # Block for pandas 0.23.4: in pandas.core.internals.Block
memory.py
使用以下内容创建文件 memory.py。
"""
Implements MemoryLayout copy factory to change memory layout
of `numpy.ndarrays`.
Depending on the use case, operations on DataFrames can be much
faster if the appropriate memory layout is set and preserved.
The implementation allows for changing the desired layout. Changes apply when
copies or new objects are created, as for example, when slicing or aggregating
via groupby ...
This implementation tries to solve the issue raised on GitHub
https://github.com/pandas-dev/pandas/issues/26502
"""
import numpy as np
_DEFAULT_MEMORY_LAYOUT = 'K'
class MemoryLayout(object):
"""
Memory layout management for numpy.ndarrays.
Singleton implementation.
Example:
>>> from memory import memory_layout
>>> memory_layout.order = 'K' #
>>> # K ... keep array layout from input
>>> # C ... set to c-contiguous / column order
>>> # F ... set to f-contiguous / row order
>>> array = memory_layout.apply(array)
>>> array = memory_layout.apply(array, 'C')
>>> array = memory_layout.copy(array)
>>> array = memory_layout.apply_on_transpose(array)
"""
_order = _DEFAULT_MEMORY_LAYOUT
_instance = None
@property
def order(self):
"""
Return memory layout ordering.
:return: `str`
"""
if self.__class__._order is None:
raise AssertionError("Array layout order not set.")
return self.__class__._order
@order.setter
def order(self, order):
"""
Set memory layout order.
Allowed values are 'C', 'F', and 'K'. Raises AssertionError
when trying to set other values.
:param order: `str`
:return: `None`
"""
assert order in ['C', 'F', 'K'], "Only 'C', 'F' and 'K' supported."
self.__class__._order = order
def __new__(cls):
"""
Create only one instance throughout the lifetime of this process.
:return: `MemoryLayout` instance as singleton
"""
if cls._instance is None:
cls._instance = super(MemoryLayout, cls).__new__(MemoryLayout)
return cls._instance
@staticmethod
def get_from(array):
"""
Get memory layout from array
Possible values:
'C' ... only C-contiguous or column order
'F' ... only F-contiguous or row order
'O' ... other: both, C- and F-contiguous or both
not C- or F-contiguous (as on empty arrays).
:param array: `numpy.ndarray`
:return: `str`
"""
if array.flags.c_contiguous == array.flags.f_contiguous:
return 'O'
return {True: 'C', False: 'F'}[array.flags.c_contiguous]
def apply(self, array, order=None):
"""
Apply the order set or the order given as input on the array
given as input.
Possible values:
'C' ... apply C-contiguous layout or column order
'F' ... apply F-contiguous layout or row order
'K' ... keep the given layout
:param array: `numpy.ndarray`
:param order: `str`
:return: `np.ndarray`
"""
order = self.__class__._order if order is None else order
if order == 'K':
return array
array_order = MemoryLayout.get_from(array)
if array_order == order:
return array
return np.reshape(np.ravel(array), array.shape, order=order)
def copy(self, array, order=None):
"""
Return a copy of the input array with the memory layout set.
Layout set:
'C' ... return C-contiguous copy
'F' ... return F-contiguous copy
'K' ... return copy with same layout as
given by the input array.
:param array: `np.ndarray`
:return: `np.ndarray`
"""
order = order if order is not None else self.__class__._order
return array.copy(order=self.get_from(array)) if order == 'K' \
else array.copy(order=order)
def copy_transposed(self, array):
"""
Return a copy of the input array in order that its transpose
has the memory layout set.
Note: numpy simply changes the memory layout from row to column
order instead of reshuffling the data in memory.
Layout set:
'C' ... return F-contiguous copy
'F' ... return C-contiguous copy
'K' ... return copy with oposite (C versus F) layout as
given by the input array.
:param array: `np.ndarray`
:return: `np.ndarray`
:param array:
:return:
"""
if self.__class__._order == 'K':
return array.copy(
order={'C': 'C', 'F': 'F', 'O': None}[self.get_from(array)])
else:
return array.copy(
order={'C': 'F', 'F': 'C'}[self.__class__._order])
def __str__(self):
return str(self.__class__._order)
memory_layout = MemoryLayout() # Singleton