AWS Athena table 来自 python 带日期的输出 - 日期被错误转换

AWS Athena table from python output with dates - dates get wrongly converted

我有一个包含日期列(“2022-02-02”)的 pandas DataFrame。我使用 pyarrow 将此 table 写入镶木地板。

df[col] = df[col].astype(str)
df.to_parquet(loc)

现在我在 Athena

中将其注册为 table
CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
  dt_utc date,
  something string,
  else int
)
STORED AS PARQUET
LOCATION 's3://bucket/loc/'
TBLPROPERTIES (
    'skip.header.line.count'='1'
)

但是我没有让日期列被接受。

我认为要使其正常工作,您需要将日期列 dt_utc 保存为镶木地板中的 date32

import pandas as pd
import pyarrow as pa

df = pd.DataFrame(
    {
        "dt_utc": ["2021-01-02", "2021-01-03"],
        "something": ["abc", "efg"],
        "else": [1, 2],
        
    }
)
df['dt_utc'] = pd.to_datetime(df['dt_utc'])

schema = pa.schema([
    pa.field("dt_utc", pa.date32()),
    pa.field("something", pa.string()),
    pa.field("else", pa.int32()),
    
])

df.to_parquet(loc, schema=schema)

编辑:如果您需要以编程方式将列更改为日期:

date_columns = ["dt_utc"]

for date_column in date_columns:
    df[date_column] = pd.to_datetime(df[date_column])

schema = pa.Schema.from_pandas(df)

schema = pa.schema([
    pa.field(field.name, pa.date32()) if field.name in date_columns else field
    for field in schema
])

df.to_parquet("hello.parquet", schema=schema)