有效使用 RPy（或其他方式）将数据帧从 Pandas 移动到 R

Question

我在 Pandas 中有一个数据框，我想使用 R 函数对其进行一些统计。没问题！ RPy 可以轻松地将数据帧从 Pandas 发送到 R:

import pandas as pd
df = pd.DataFrame(index=range(100000),columns=range(100))
from rpy2 import robjects as ro
ro.globalenv['df'] = df

如果我们在 IPython:

%load_ext rmagic
%R -i df

出于某种原因，ro.globalenv 路线比 rmagic 路线稍慢，但没关系。重要的是：我最终将使用的数据帧约为 100GB。这带来了一些问题：

即使只有 1GB 的数据，传输速度也相当慢。
如果我理解正确，这会在内存中创建两个数据帧副本：一个在 Python 中，一个在 R 中。这意味着我的内存需求将增加一倍，而我还没有甚至运行统计测试！

有什么办法可以：

更快地在 Python 和 R 之间传输大型数据帧？
访问内存中的同一个对象？我怀疑这是要月亮

Answer 1

rpy2 正在使用一种转换机制，试图避免在 Python 和 R 之间移动时复制对象。但是，这目前仅在 R -> [=24= 方向上起作用].

Python 有一个名为 "buffer interface" 的接口，由 rpy2 使用，它可以最大限度地减少 R 和 [=24 之间兼容的 C 级副本的数量=]（参见 http://rpy.sourceforge.net/rpy2/doc-2.5/html/numpy.html#from-rpy2-to-numpy - 该文档似乎已过时，因为 __array_struct__ 界面不再是主要选择）。

在 R 中没有与缓冲区接口等效的东西，目前阻碍我在 rpy2 中提供等效功能的问题是在垃圾收集期间处理借用的引用（以及没有时间仔细考虑一下）。

总而言之，有一种无需复制即可在 Python 和 R 之间共享数据的方法，但这需要在 R 中创建数据。

Answer 2

目前，feather 似乎是 R 的 DataFrame 和 pandas 之间数据交换最有效的选择。

有效使用 RPy（或其他方式）将数据帧从 Pandas 移动到 R

Efficiently moving dataframes from Pandas to R with RPy (or other means)

python

r

dataframe

rpy2

*有效*使用 RPy（或其他方式）将数据帧从 Pandas 移动到 R

*Efficiently* moving dataframes from Pandas to R with RPy (or other means)

python

r

dataframe

rpy2

有效使用 RPy（或其他方式）将数据帧从 Pandas 移动到 R

Efficiently moving dataframes from Pandas to R with RPy (or other means)