Pyarrow 时间戳不断转换为 1970

Pyarrow timestamp keeps converting to 1970

我正在尝试将时间戳与我的数据框中的所有其他数据一起存储,表示数据存储到磁盘的时间,在 Parquet 文件中。通常我只是将时间戳存储在 pandas 数据帧本身中,但是 pyarrow 不喜欢 pandas' 存储时间戳的方式,并抱怨说当我 运行 pa.Table.from_pandas() 不管我做什么。一种解决方法是直接将时间戳附加为 table 中的一列,但是由于某种原因,pyarrow 一直将时间戳转换为 1970。我尝试了多种解决方法,但似乎没有任何效果。

见下文,一个复制问题的工作代码示例。在此示例中,实际上并未对 table 进行追加,但它显示了问题 - datetime.now().timestamp() 返回的时间戳是正确的,但是当它转换为 pyarrow 数组时它重置为 1970.

from datetime import datetime
import pyarrow as pa
import numpy as np
import pandas as pd

data = pd.DataFrame(np.random.uniform(size=(20,10)))
df = pd.DataFrame(data)
df.columns = [str(i) for i in range(data.shape[1])]
schema = [(str(i), pa.float32()) for i in range(data.shape[1])]
schema = pa.schema(schema)

ts = datetime.now().timestamp()
print('DateTime timestamp:', ts)
table = pa.Table.from_pandas(df, schema)
pa_ts = pa.array([ts] * len(table), pa.timestamp('us'))
print('PyArrow timestamp:', pa_ts)

这是我得到的输出:

DateTime timestamp: 1650817852.093818
PyArrow timestamp: [
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852,
  1970-01-01 00:27:30.817852
]

正如 FObersteiner 所提到的,这里的问题是因为我告诉 pyarrow 从假定的 microsecond-level 时间戳进行转换。万一以后有人遇到这个问题,只要把上面的'us'改成's'就很简单了。如果你想要 millisecond-level 时间戳,你可以这样做:

from datetime import datetime
import pyarrow as pa
import numpy as np
import pandas as pd

data = pd.DataFrame(np.random.uniform(size=(20,10)))
df = pd.DataFrame(data)
df.columns = [str(i) for i in range(data.shape[1])]
schema = [(str(i), pa.float32()) for i in range(data.shape[1])]
schema = pa.schema(schema)

ts = datetime.now().timestamp()*1000
print('DateTime timestamp:', ts)
table = pa.Table.from_pandas(df, schema)
pa_ts = pa.array([ts] * len(table), pa.timestamp('ms'))
print('PyArrow timestamp:', pa_ts)