如何将保存在 pandas 中的数据框作为 R 中的 HDF5 文件加载?
How can I load a data frame saved in pandas as an HDF5 file in R?
我将 pandas 中的数据帧保存在 HDF5 文件中:
import numpy as np
import pandas as pd
np.random.seed(1)
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print('frame: {0}'.format(frame))
store = pd.HDFStore('file.h5')
store['df'] = frame
store.close()
框架如下所示:
frame: b d e
Utah 1.624345 -0.611756 -0.528172
Ohio -1.072969 0.865408 -2.301539
Texas 1.744812 -0.761207 0.319039
Oregon -0.249370 1.462108 -2.060141
我正在尝试在 R 中加载它:
#source("http://bioconductor.org/biocLite.R")
#biocLite("rhdf5")
library(rhdf5)
frame = h5ls("file.h5")
frame
然而,一旦加载到 R 中,它看起来如下:
> frame
group name otype dclass dim
0 / df H5I_GROUP
1 /df axis0 H5I_DATASET STRING 3
2 /df axis1 H5I_DATASET STRING 4
3 /df block0_items H5I_DATASET STRING 3
4 /df block0_values H5I_DATASET FLOAT 3 x 4
>
我也试过:
frame2 = h5read("file.h5", '/df')
frame2
但是它 returns 有几个值但没有数据框:
> frame2
$axis0
[1] "b" "d" "e"
$axis1
[1] "Utah" "Ohio" "Texas" "Oregon"
$block0_items
[1] "b" "d" "e"
$block0_values
[,1] [,2] [,3] [,4]
[1,] 1.6243454 -1.0729686 1.7448118 -0.2493704
[2,] -0.6117564 0.8654076 -0.7612069 1.4621079
[3,] -0.5281718 -2.3015387 0.3190391 -2.0601407
如何在 R 中加载保存在 pandas 中的数据帧作为 HDF5 文件?
更新 这是 pandas 文档中推荐的方法:https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#external-compatibility
来自 https://github.com/pandas-dev/pandas/issues/9636 (thanks John Galt 为我指出此资源):
R 的 HDF5 导出示例
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame({"first": np.random.rand(100),
"second": np.random.rand(100),
"class": np.random.randint(0, 2, (100,))},
index=range(100))
print(df.head())
store = pd.HDFStore("transfer.hdf5", "w", complib=str("zlib"), complevel=5)
store.put("dataframe", df, data_columns=df.columns)
store.close()
输出:
class first second
0 0 0.417022 0.326645
1 0 0.720324 0.527058
2 1 0.000114 0.885942
3 1 0.302333 0.357270
4 1 0.146756 0.908535
在 R 中:
# Load values and column names for all datasets from corresponding nodes and
# insert them into one data.frame object.
library(rhdf5)
loadhdf5data <- function(h5File) {
listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
columns = list()
for (idx in seq(data_paths)) {
data <- data.frame(t(h5read(h5File, data_paths[idx])))
names <- t(h5read(h5File, name_paths[idx]))
entry <- data.frame(data)
colnames(entry) <- names
columns <- append(columns, entry)
}
data <- data.frame(columns)
return(data)
}
现在您可以导入 DataFrame:
> data = loadhdf5data("transfer.hdf5")
> head(data)
first second class
1 0.4170220047 0.3266449 0
2 0.7203244934 0.5270581 0
3 0.0001143748 0.8859421 1
4 0.3023325726 0.3572698 1
5 0.1467558908 0.9085352 1
6 0.0923385948 0.6233601 1
我将 pandas 中的数据帧保存在 HDF5 文件中:
import numpy as np
import pandas as pd
np.random.seed(1)
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print('frame: {0}'.format(frame))
store = pd.HDFStore('file.h5')
store['df'] = frame
store.close()
框架如下所示:
frame: b d e
Utah 1.624345 -0.611756 -0.528172
Ohio -1.072969 0.865408 -2.301539
Texas 1.744812 -0.761207 0.319039
Oregon -0.249370 1.462108 -2.060141
我正在尝试在 R 中加载它:
#source("http://bioconductor.org/biocLite.R")
#biocLite("rhdf5")
library(rhdf5)
frame = h5ls("file.h5")
frame
然而,一旦加载到 R 中,它看起来如下:
> frame
group name otype dclass dim
0 / df H5I_GROUP
1 /df axis0 H5I_DATASET STRING 3
2 /df axis1 H5I_DATASET STRING 4
3 /df block0_items H5I_DATASET STRING 3
4 /df block0_values H5I_DATASET FLOAT 3 x 4
>
我也试过:
frame2 = h5read("file.h5", '/df')
frame2
但是它 returns 有几个值但没有数据框:
> frame2
$axis0
[1] "b" "d" "e"
$axis1
[1] "Utah" "Ohio" "Texas" "Oregon"
$block0_items
[1] "b" "d" "e"
$block0_values
[,1] [,2] [,3] [,4]
[1,] 1.6243454 -1.0729686 1.7448118 -0.2493704
[2,] -0.6117564 0.8654076 -0.7612069 1.4621079
[3,] -0.5281718 -2.3015387 0.3190391 -2.0601407
如何在 R 中加载保存在 pandas 中的数据帧作为 HDF5 文件?
更新 这是 pandas 文档中推荐的方法:https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#external-compatibility
来自 https://github.com/pandas-dev/pandas/issues/9636 (thanks John Galt 为我指出此资源):
R 的 HDF5 导出示例
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame({"first": np.random.rand(100),
"second": np.random.rand(100),
"class": np.random.randint(0, 2, (100,))},
index=range(100))
print(df.head())
store = pd.HDFStore("transfer.hdf5", "w", complib=str("zlib"), complevel=5)
store.put("dataframe", df, data_columns=df.columns)
store.close()
输出:
class first second
0 0 0.417022 0.326645
1 0 0.720324 0.527058
2 1 0.000114 0.885942
3 1 0.302333 0.357270
4 1 0.146756 0.908535
在 R 中:
# Load values and column names for all datasets from corresponding nodes and
# insert them into one data.frame object.
library(rhdf5)
loadhdf5data <- function(h5File) {
listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
columns = list()
for (idx in seq(data_paths)) {
data <- data.frame(t(h5read(h5File, data_paths[idx])))
names <- t(h5read(h5File, name_paths[idx]))
entry <- data.frame(data)
colnames(entry) <- names
columns <- append(columns, entry)
}
data <- data.frame(columns)
return(data)
}
现在您可以导入 DataFrame:
> data = loadhdf5data("transfer.hdf5")
> head(data)
first second class
1 0.4170220047 0.3266449 0
2 0.7203244934 0.5270581 0
3 0.0001143748 0.8859421 1
4 0.3023325726 0.3572698 1
5 0.1467558908 0.9085352 1
6 0.0923385948 0.6233601 1