Pandas convert_to_r_dataframe 函数 KeyError
Pandas convert_to_r_dataframe function KeyError
我创建了一个 pandas DataFrame:
import pandas as pd
df = pd.DataFrame(x.toarray(), columns = colnames)
然后我将其转换为 R 数据帧:
import pandas.rpy.common as com
rdf = com.convert_to_r_dataframe(df)
在Windows下用这个配置没有问题:
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.7.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: AMD64 Family 16 Model 4
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...
但是当我用这个配置在 Linux 上执行它时:
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...
我明白了:
Traceback (most recent call last):
File "app.py", line 232, in <module>
clf.global_cl(df, df2)
File "/home/uzer/app/util/clftool.py", line 202, in global_cl
rdf = com.convert_to_r_dataframe(df)
File "/home/uzer/app/venv/local/lib/python2.7/site-packages/pandas/rpy/common.py", line 324, in convert_to_r_dataframe
value = VECTOR_TYPES[value_type](value)
KeyError: <type 'numpy.int64'>
似乎VECTOR_TYPES 没有<type 'numpy.int64'>
作为键。但这不是真的:
VECTOR_TYPES = {np.float64: robj.FloatVector,
np.float32: robj.FloatVector,
np.float: robj.FloatVector,
np.int: robj.IntVector,
np.int32: robj.IntVector,
np.int64: robj.IntVector,
np.object_: robj.StrVector,
np.str: robj.StrVector,
np.bool: robj.BoolVector}
所以我在 convert_to_r_dataframe
中打印了变量类型(在 ../pandas/rpy/common.py
中):
for column in df:
value = df[column]
value_type = value.dtype.type
print("value_type: %s") % value_type
if value_type == np.datetime64:
value = convert_to_r_posixct(value)
else:
value = [item if pd.notnull(item) else NA_TYPES[value_type]
for item in value]
print("Is value_type == np.int64: %s") % (value_type is np.int64)
value = VECTOR_TYPES[value_type](value)
...
这就是结果:
value_type: <type 'numpy.int64'>
Is value_type == np.int64: False
这怎么可能??鉴于 32 位 Windows 版本没有问题,64 位 Linux Python 版本可能有问题吗?
编辑: @lgautier 建议,我修改了这个:
rdf = com.convert_to_r_dataframe(df)
至:
from rpy2.robjects import pandas2ri
rdf = pandas2ri.pandas2ri(df)
这奏效了。
注意:我的数据框包含 utf-8 特殊字符,作为列名,以 unicode 解码。当调用 DataFrame
构造函数时(包含在 rpy2/robjects/vectors.py
中),此行尝试将 unicode 字符串(包含特殊字符)编码为 ascii 字符串:
kv = [(str(k), conversion.py2ri(obj[k])) for k in obj]
这会产生一个错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
为了解决这个问题,我必须更改该行:
kv = [(k.encode('UTF-8'), conversion.py2ri(obj[k])) for k in obj]
Rpy2 应该引入一种允许更改编码的方法。
考虑使用 rpy2 自己的转换(它似乎与 Linux 上的 int64
一起工作):
# create a test DataFrame
import numpy
import pandas
i2d = numpy.array([[1, 2, 3], [4, 5, 6]], dtype="int64")
colnames = ('a', 'b', 'c')
dataf = pandas.DataFrame(i2d,
columns = colnames)
# rpy2's conversion of pandas objects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
现在pandas DataFrame 对象将被自动转换
到 rpy2/R 每次调用时使用嵌入式 R 的 DataFrame 对象。
例如:
from rpy2.robjects.packages import importr
# R's "base" package
base = importr('base')
# call the R function "summary"
print(base.summary(dataf))
也可以显式调用转换:
from rpy2.robjects import conversion
rpy2_dataf = conversion.py2ro(dataf)
编辑:(添加自定义以解决 str(k)
问题)
如果与转换相关的任何事情需要本地定制,这可以相对容易地实现。单程
改变 R DataFrame
的构建方式是:
import pandas.DataFrame as PandasDataFrame
import rpy2.robjects.vectors.DataFrame as RDataFrame
from rpy2 import rinterface
@conversion.py2ro.register(PandasDataFrame)
def py2ro_pandasdataframe(obj):
ri_dataf = conversion.py2ri(obj)
# cast down to an R list (goes through a different code path
# in the DataFrame constructor, avoiding `str(k)`)
ri_list = rinterface.SexpVector(ri_dataf)
return RDataFrame(ri_list)
以后pandas使用上面的转换函数
DataFrame
存在:
rpy2_dataf = conversion.py2ro(dataf)
我创建了一个 pandas DataFrame:
import pandas as pd
df = pd.DataFrame(x.toarray(), columns = colnames)
然后我将其转换为 R 数据帧:
import pandas.rpy.common as com
rdf = com.convert_to_r_dataframe(df)
在Windows下用这个配置没有问题:
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.7.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: AMD64 Family 16 Model 4
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...
但是当我用这个配置在 Linux 上执行它时:
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...
我明白了:
Traceback (most recent call last):
File "app.py", line 232, in <module>
clf.global_cl(df, df2)
File "/home/uzer/app/util/clftool.py", line 202, in global_cl
rdf = com.convert_to_r_dataframe(df)
File "/home/uzer/app/venv/local/lib/python2.7/site-packages/pandas/rpy/common.py", line 324, in convert_to_r_dataframe
value = VECTOR_TYPES[value_type](value)
KeyError: <type 'numpy.int64'>
似乎VECTOR_TYPES 没有<type 'numpy.int64'>
作为键。但这不是真的:
VECTOR_TYPES = {np.float64: robj.FloatVector,
np.float32: robj.FloatVector,
np.float: robj.FloatVector,
np.int: robj.IntVector,
np.int32: robj.IntVector,
np.int64: robj.IntVector,
np.object_: robj.StrVector,
np.str: robj.StrVector,
np.bool: robj.BoolVector}
所以我在 convert_to_r_dataframe
中打印了变量类型(在 ../pandas/rpy/common.py
中):
for column in df:
value = df[column]
value_type = value.dtype.type
print("value_type: %s") % value_type
if value_type == np.datetime64:
value = convert_to_r_posixct(value)
else:
value = [item if pd.notnull(item) else NA_TYPES[value_type]
for item in value]
print("Is value_type == np.int64: %s") % (value_type is np.int64)
value = VECTOR_TYPES[value_type](value)
...
这就是结果:
value_type: <type 'numpy.int64'>
Is value_type == np.int64: False
这怎么可能??鉴于 32 位 Windows 版本没有问题,64 位 Linux Python 版本可能有问题吗?
编辑: @lgautier 建议,我修改了这个:
rdf = com.convert_to_r_dataframe(df)
至:
from rpy2.robjects import pandas2ri
rdf = pandas2ri.pandas2ri(df)
这奏效了。
注意:我的数据框包含 utf-8 特殊字符,作为列名,以 unicode 解码。当调用 DataFrame
构造函数时(包含在 rpy2/robjects/vectors.py
中),此行尝试将 unicode 字符串(包含特殊字符)编码为 ascii 字符串:
kv = [(str(k), conversion.py2ri(obj[k])) for k in obj]
这会产生一个错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
为了解决这个问题,我必须更改该行:
kv = [(k.encode('UTF-8'), conversion.py2ri(obj[k])) for k in obj]
Rpy2 应该引入一种允许更改编码的方法。
考虑使用 rpy2 自己的转换(它似乎与 Linux 上的 int64
一起工作):
# create a test DataFrame
import numpy
import pandas
i2d = numpy.array([[1, 2, 3], [4, 5, 6]], dtype="int64")
colnames = ('a', 'b', 'c')
dataf = pandas.DataFrame(i2d,
columns = colnames)
# rpy2's conversion of pandas objects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
现在pandas DataFrame 对象将被自动转换 到 rpy2/R 每次调用时使用嵌入式 R 的 DataFrame 对象。 例如:
from rpy2.robjects.packages import importr
# R's "base" package
base = importr('base')
# call the R function "summary"
print(base.summary(dataf))
也可以显式调用转换:
from rpy2.robjects import conversion
rpy2_dataf = conversion.py2ro(dataf)
编辑:(添加自定义以解决 str(k)
问题)
如果与转换相关的任何事情需要本地定制,这可以相对容易地实现。单程
改变 R DataFrame
的构建方式是:
import pandas.DataFrame as PandasDataFrame
import rpy2.robjects.vectors.DataFrame as RDataFrame
from rpy2 import rinterface
@conversion.py2ro.register(PandasDataFrame)
def py2ro_pandasdataframe(obj):
ri_dataf = conversion.py2ri(obj)
# cast down to an R list (goes through a different code path
# in the DataFrame constructor, avoiding `str(k)`)
ri_list = rinterface.SexpVector(ri_dataf)
return RDataFrame(ri_list)
以后pandas使用上面的转换函数
DataFrame
存在:
rpy2_dataf = conversion.py2ro(dataf)