使用 rpy2 来自 pandas DataFrame 的分位数回归模型中的不一致数组

Non-conformable arrays in quantile regression model from pandas DataFrame using rpy2

我正在使用 rpy2 (2.7.6) 对 engel 数据集进行分位数回归:

import statsmodels as sm

from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

pandas2ri.activate()

quantreg = importr('quantreg')

data = sm.datasets.engel.load_pandas().data

qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)

但是这会产生以下错误:

qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
Traceback (most recent call last):

  File "<ipython-input-22-02ee1015737c>", line 1, in <module>
    quantreg.rq('foodexp ~ income', data=data, tau=0.5)

  File "C:\Anaconda\lib\site-packages\rpy2\robjects\functions.py", line 178, in __call__
    return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)

  File "C:\Anaconda\lib\site-packages\rpy2\robjects\functions.py", line 106, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)

RRuntimeError: Error in y - x %*% z$coef : non-conformable arrays

据我了解,在这种情况下,不一致的数组意味着存在一些缺失值或使用的 'arrays' 大小不同。我可以确认情况并非如此:

data.count()
Out[26]: 
income     235
foodexp    235
dtype: int64

data.shape
Out[27]: (235, 2)

这个错误还有什么意思? rpy2 中从 DataFrame 到 data.frame 的转换是否可能无法正常工作,或者我在这里遗漏了什么?其他人可以确认这个错误吗?

为了以防万一,这里有一些关于 R 和 Python 版本的信息。

R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

Python 2.7.11 |Anaconda 2.3.0 (64-bit)| (default, Dec  7 2015, 14:10:42) [MSC v.1500 64 bit (AMD64)]
on win32

如有任何帮助,我们将不胜感激。

编辑 1:

如果我直接从 R 加载数据集,我不会收到错误消息:

from rpy2.robjects import r

r.data('engel')
data = r['engel']

qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)

所以我认为 pandas2ri 的转换有问题。当我尝试使用 pandas2ri.py2ri.

手动将 DataFrame 转换为 data.frame 时,会发生同样的错误

编辑 2:

有趣的是,如果我使用已弃用的 pandas.rpy.common.convert_to_r_dataframe,错误就消失了:

import pandas.rpy.common as com

rdata = com.convert_to_r_dataframe(data)

qreg = quantreg.rq('foodexp ~ income', data=rdata, tau=0.5)

pandas2ri 中确实存在错误,这也已得到证实 here

正如在 rpy2 issue tracker 上的回答:

问题的根源似乎是 pandas 数据框中的列被转换为 Array 对象,每个对象只有一列。

>>> pandas2ri.py2ri_pandasdataframe(data) 
<DataFrame - Python:0x7f8af3c2afc8 / R:0x92958b0>
[Array, Array]
  income: <class 'rpy2.robjects.vectors.Array'>
  <Array - Python:0x7f8af57ef908 / R:0x92e1bf0>
[420.157651, 541.411707, 901.157457, ..., 581.359892, 743.077243, 1057.676711]
  foodexp: <class 'rpy2.robjects.vectors.Array'>
  <Array - Python:0x7f8af3c2ab88 / R:0x92e7600>
[255.839425, 310.958667, 485.680014, ..., 468.000798, 522.601906, 750.320163]

区别很微妙,但这似乎混淆了 quantreg 包。还有其他 R 函数似乎独立于对象是具有一列的数组还是向量。

将列转换为 R 向量似乎是解决问题所需要的:

from rpy2.robjects.vectors import FloatVector
mydata=pandas2ri.py2ri_pandasdataframe(data)

from rpy2.robjects.packages import importr
base=importr('base')
mydata[0]=base.as_vector(mydata[0])
mydata[1]=base.as_vector(mydata[1])
# now this is working
qreg = quantreg.rq('foodexp ~ income', data=mydata, tau=0.5)

现在我想收集更多关于这是否可以在不破坏其他东西的情况下解决问题的数据。为此,我将修复程序变成了派生自 pandas 转换器的自定义转换器:

from rpy2.robjects import default_converter
from rpy2.robjects.conversion import Converter, localconverter

from rpy2.robjects.packages import importr
from rpy2.robjects import numpy2ri, pandas2ri, vectors
import numpy

my_converter = Converter('my converter',
                         template=pandas2ri.converter)
base=importr('base')

def ndarray_forcevector(obj):
    func=numpy2ri.converter.py2ri.registry[numpy.ndarray]
    # current conversion as performed by numpy
    res=func(obj)
    if len(obj.shape) == 1:
        # force into an R vector
        res=base.as_vector(res)
    return res

@my_converter.py2ri.register(pandas2ri.PandasSeries)
def py2ri_pandasseries(obj):
    # this is a copy of the function with the same name in pandas2ri, with
    # the call to ndarray_forcevector() as the only difference
    if obj.dtype == '<M8[ns]':
        # time series
        d = [vectors.IntVector([x.year for x in obj]),
             vectors.IntVector([x.month for x in obj]),
             vectors.IntVector([x.day for x in obj]),
             vectors.IntVector([x.hour for x in obj]),
             vectors.IntVector([x.minute for x in obj]),
             vectors.IntVector([x.second for x in obj])]
        res = vectors.ISOdatetime(*d)
        #FIXME: can the POSIXct be created from the POSIXct constructor ?
        # (is '<M8[ns]' mapping to Python datetime.datetime ?)
        res = vectors.POSIXct(res)
    else:
        # converted as a numpy array
        res = ndarray_forcevector(obj)
    # "index" is equivalent to "names" in R
    if obj.ndim == 1:
        res.do_slot_assign('names',
                           vectors.StrVector(tuple(str(x) for x in obj.index)))
    else:
        res.do_slot_assign('dimnames',
                          vectors.SexpVector(conversion.py2ri(obj.index)))
    return res

使用这个新转换器的最简单方法可能是在上下文管理器中:

with localconverter(default_converter + my_converter) as cv:
    qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)