使用 tidyverse 从 s3 bucket 中读取数据
Using tidyverse to read data from s3 bucket
我正在尝试读取存储在 s3 存储桶中的 .csv
文件,但出现错误。我正在按照说明进行操作 here,但要么它不起作用,要么我犯了一个错误,我没有明白我做错了什么。
这是我正在尝试做的事情:
# I'm working on a SageMaker notebook instance
library(reticulate)
library(tidyverse)
sagemaker <- import('sagemaker')
sagemaker.session <- sagemaker$Session()
region <- sagemaker.session$boto_region_name
bucket <- "my-bucket"
prefix <- "data/staging"
bucket.path <- sprintf("https://s3-%s.amazonaws.com/%s", region, bucket)
role <- sagemaker$get_execution_role()
client <- sagemaker.session$boto_session$client('s3')
key <- sprintf("%s/%s", prefix, 'my_file.csv')
my.obj <- client$get_object(Bucket=bucket, Key=key)
my.df <- read_csv(my.obj$Body) # This is where it all breaks down:
##
## Error: `file` must be a string, raw vector or a connection.
## Traceback:
##
## 1. read_csv(my.obj$Body)
## 2. read_delimited(file, tokenizer, col_names = col_names, col_types = col_types,
## . locale = locale, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment, n_max = n_max, guess_max = guess_max,
## . progress = progress)
## 3. col_spec_standardise(data, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment, guess_max = guess_max, col_names = col_names,
## . col_types = col_types, tokenizer = tokenizer, locale = locale)
## 4. datasource(file, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment)
## 5. stop("`file` must be a string, raw vector or a connection.",
## . call. = FALSE)
使用 Python 时,我可以使用如下方式读取 CSV 文件:
import pandas as pd
# ... Lots of boilerplate code
my_data = pd.read_csv(client.get_object(Bucket=bucket, Key=key)['Body'])
这与我在 R 中尝试做的非常相似,并且它适用于 Python...那么为什么它不适用于 R?
你能给我指明正确的道路吗?
注意: 虽然我可以为此使用 Python 内核,但我还是想坚持使用 R,因为我对它的使用比对Python,至少在数据帧处理方面是这样。
我建议尝试使用 aws.s3
软件包:
https://github.com/cloudyr/aws.s3
非常简单 - 设置环境变量:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1",
"AWS_SESSION_TOKEN" = "mytoken")
然后一旦它不碍事:
aws.s3::s3read_using(read.csv, object = "s3://bucket/folder/data.csv")
更新:我看到您也已经熟悉 boto 并尝试使用 reticulate,所以在这里留下这个简单的包装器:
https://github.com/cloudyr/roto.s3
看起来它有一个很棒的api,例如您打算使用的可变布局:
download_file(
bucket = "is.rud.test",
key = "mtcars.csv",
filename = "/tmp/mtcars-again.csv",
profile_name = "personal"
)
read_csv("/tmp/mtcars-again.csv")
我正在尝试读取存储在 s3 存储桶中的 .csv
文件,但出现错误。我正在按照说明进行操作 here,但要么它不起作用,要么我犯了一个错误,我没有明白我做错了什么。
这是我正在尝试做的事情:
# I'm working on a SageMaker notebook instance
library(reticulate)
library(tidyverse)
sagemaker <- import('sagemaker')
sagemaker.session <- sagemaker$Session()
region <- sagemaker.session$boto_region_name
bucket <- "my-bucket"
prefix <- "data/staging"
bucket.path <- sprintf("https://s3-%s.amazonaws.com/%s", region, bucket)
role <- sagemaker$get_execution_role()
client <- sagemaker.session$boto_session$client('s3')
key <- sprintf("%s/%s", prefix, 'my_file.csv')
my.obj <- client$get_object(Bucket=bucket, Key=key)
my.df <- read_csv(my.obj$Body) # This is where it all breaks down:
##
## Error: `file` must be a string, raw vector or a connection.
## Traceback:
##
## 1. read_csv(my.obj$Body)
## 2. read_delimited(file, tokenizer, col_names = col_names, col_types = col_types,
## . locale = locale, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment, n_max = n_max, guess_max = guess_max,
## . progress = progress)
## 3. col_spec_standardise(data, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment, guess_max = guess_max, col_names = col_names,
## . col_types = col_types, tokenizer = tokenizer, locale = locale)
## 4. datasource(file, skip = skip, skip_empty_rows = skip_empty_rows,
## . comment = comment)
## 5. stop("`file` must be a string, raw vector or a connection.",
## . call. = FALSE)
使用 Python 时,我可以使用如下方式读取 CSV 文件:
import pandas as pd
# ... Lots of boilerplate code
my_data = pd.read_csv(client.get_object(Bucket=bucket, Key=key)['Body'])
这与我在 R 中尝试做的非常相似,并且它适用于 Python...那么为什么它不适用于 R?
你能给我指明正确的道路吗?
注意: 虽然我可以为此使用 Python 内核,但我还是想坚持使用 R,因为我对它的使用比对Python,至少在数据帧处理方面是这样。
我建议尝试使用 aws.s3
软件包:
https://github.com/cloudyr/aws.s3
非常简单 - 设置环境变量:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
"AWS_SECRET_ACCESS_KEY" = "mysecretkey",
"AWS_DEFAULT_REGION" = "us-east-1",
"AWS_SESSION_TOKEN" = "mytoken")
然后一旦它不碍事:
aws.s3::s3read_using(read.csv, object = "s3://bucket/folder/data.csv")
更新:我看到您也已经熟悉 boto 并尝试使用 reticulate,所以在这里留下这个简单的包装器: https://github.com/cloudyr/roto.s3
看起来它有一个很棒的api,例如您打算使用的可变布局:
download_file(
bucket = "is.rud.test",
key = "mtcars.csv",
filename = "/tmp/mtcars-again.csv",
profile_name = "personal"
)
read_csv("/tmp/mtcars-again.csv")