如何通过 Python Boto3 将数据加载到 Amazon Redshift？

Question

在 Amazon Redshift 的 Getting Started Guide, data is pulled from Amazon S3 and loaded into an Amazon Redshift Cluster utilizing SQLWorkbench/J. I'd like to mimic the same process of connecting to the cluster and loading sample data into the cluster utilizing Boto3 中。

但是在 Redshift 的 Boto3's documentation 中，我无法找到允许我将数据上传到 Amazon Redshift 集群的方法。

我已经能够使用 Boto3 通过以下代码连接到 Redshift：

client = boto3.client('redshift')

但我不确定哪种方法可以让我创建表或将数据上传到 Amazon Redshift，就像在 tutorial with SQLWorkbenchJ 中完成的那样。

Answer 1

返回您链接的教程中的第 4 步。看到它向您展示了如何获取集群的 URL 吗？您必须使用 PostgreSQL 驱动程序连接到 URL。 Boto3 等 AWS SDK 提供对 AWS API 的访问。您需要通过 PostgreSQL API 连接到 Redshift，就像连接到 RDS 上的 PostgreSQL 数据库一样。

Answer 2

对，你需要psycopg2 Python模块来执行COPY命令。

我的代码如下所示：

import psycopg2
#Amazon Redshift connect string 
conn_string = "dbname='***' port='5439' user='***' password='***' host='mycluster.***.redshift.amazonaws.com'"  
#connect to Redshift (database should be open to the world)
con = psycopg2.connect(conn_string);
sql="""COPY %s FROM '%s' credentials 
      'aws_access_key_id=%s; aws_secret_access_key=%s'
       delimiter '%s' FORMAT CSV %s %s; commit;""" % 
      (to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,delim,quote,gzip)

#Here
#  fn - s3://path_to__input_file.gz
#  gzip = 'gzip'

cur = con.cursor()
cur.execute(sql)
con.close()

我用boto3/psycopg2写成CSV_Loader_For_Redshift

Answer 3

使用 psycopyg2 & get_cluster_credentials

先决条件 -

附加到相应用户的 IAM 角色

具有 get_cluster_credentials 策略的 IAM 角色 LINK
在云 (EC2) 上附加了适当的 IAM 角色

以下代码仅在您将其部署在已配置用户 AWS 凭证的 PC/VM 上时才有效 [CLI - aws configure ] 或者您在同一帐户 VPC 中的一个实例上。

有一个config.ini文件-

 [Redshift]

 port = 3389

 username = please_enter_username

 database_name = please_database-name

 cluster_id = please_enter_cluster_id_name

 url = please_enter_cluster_endpoint_url

 region = us-west-2

我的Redshift_connection.py

 import logging

 import psycopg2

 import boto3

 import ConfigParser


 def db_connection():
    logger = logging.getLogger(__name__)

    parser = ConfigParser.ConfigParser()

    parser.read('config.ini')

    RS_PORT = parser.get('Redshift','port')

    RS_USER = parser.get('Redshift','username')

    DATABASE = parser.get('Redshift','database_name')

    CLUSTER_ID = parser.get('Redshift','cluster_id')

    RS_HOST = parser.get('Redshift','url')

    REGION_NAME = parser.get('Redshift','region')

    client = boto3.client('redshift',region_name=REGION_NAME)

    cluster_creds = client.get_cluster_credentials(DbUser=RS_USER,
                                                DbName=DATABASE,
                                                ClusterIdentifier=CLUSTER_ID,
                                                AutoCreate=False)

 try:
   conn = psycopg2.connect(
     host=RS_HOST,
     port=RS_PORT,
     user=cluster_creds['DbUser'],
     password=cluster_creds['DbPassword'],
     database=DATABASE
   )

   return conn
 except psycopg2.Error:
   logger.exception('Failed to open database connection.')
   print "Failed"

查询执行脚本 -

 from Redshift_Connection import db_connection

 def executescript(redshift_cursor):
     query = "SELECT * FROM <SCHEMA_NAME>.<TABLENAME>"
     cur=redshift_cursor
     cur.execute(query)

 conn = db_connection()
 conn.set_session(autocommit=False)
 cursor = conn.cursor()
 executescript(cursor)
 conn.close()

如何通过 Python Boto3 将数据加载到 Amazon Redshift？

How to Load Data into Amazon Redshift via Python Boto3?

python

amazon-s3

amazon-web-services

amazon-redshift

boto3