使用 Python 的动态变化的列和表列表从 Redshift 中的列获取最大日期？

Question

我有一个时间戳列列表，对应于 Redshift 中的特定 tables。我希望能够获取给定 table 的所有时间戳列的最新日期。我不能只写出列名，因为 table 具有不同的列名。我有一个包含 table 名称和我需要的列的数据框

table_name      column              data_type
  tbl1       sent_at          timestamp without timezone
  tbl1       message_received timestamp without timezone
  tbl1       scene_updated    timestamp without timezone
  tbl2       phone_updated    timestamp without timezone
  tbl2       col2_updated     timestamp without timezone
  tbl3       sent_at          timestamp without timezone
  tbl3       number_updated   timestamp without timezone

我想检查每个 table，所有时间戳列的最新日期。我正在尝试创建一个查询，该查询涉及创建一个使用 'max()' 的字符串，然后填充在括号内以在查询中使用。像这样：

for table in set(df.table_name):
   sub = df[df.table_name == table]
   cols = [x for x in sub.column.values.tolist()]
   str_max = 'max()' * len(cols)
   que = 'select' + str_max + 'from {}'.format(table)
   time_table = pd.read_sql_query(que, conn) 
   ....

然后，我将使用 pandas 来获取所有列的最大值。但是，当所有列名都更改时，我无法弄清楚如何在“()”之间插入列名以获得最大值。也许 Redshift 中有一种方法可以在使用 data_type 过滤器时查看所有列值的最大值，但我不知道该怎么做。

Answer 1

我通过使用另一个 for 循环并将列名附加到字符串来解决这个问题。然后，我将列表连接成一个字符串，这样我就可以将它放入一个查询中。在查询运行之后，我使用 max() 找到所有列的最大值。

for table in set(df.table_name):
    sub = created_at_tables[created_at_tables.table_name ==  table]
    cols = [x for x in sub.column_name.values.tolist() if x != 'table_updated_at']
    col_str = []
    for i in cols: 
       col_str.append('max(' + i + ') as ' + i)
    col_str = ','.join(col_str)
    que = 'select {} from schema.{}'.format(col_str, table)
    new_table = pd.read_sql_query(que, rsm.dbengine)
    new_table.dropna(axis = 1, inplace = True)
    most_recent_date = new_table.max(axis=1).reset_index()[0][0]

使用 Python 的动态变化的列和表列表从 Redshift 中的列获取最大日期？

Get max date from columns in Redshift using a dynamically changing list of columns and tables with Python?

python

pandas

amazon-redshift