使用 python 和 pandas 将时间戳列拆分为 CSV 中的两个新列

Question

我有一个超过 210000 行的大型 CSV 文件。我是 python 和 pandas 的新手。我想有效地遍历时间戳列，将时间戳列拆分为 2 个新列（日期和时间），然后将新日期列格式化为 %Y%m%d 并删除新时间列。即只写回CSV 文件新格式化的日期列。你是怎么做到的？

输入文件示例：

   minit,timestamp,open,high,low,close
   0,2009-02-23 17:32:00,1.2708,1.2708,1.2706,1.2706
   1,2009-02-23 17:33:00,1.2708,1.2708,1.2705,1.2706
   2,2009-02-23 17:34:00,1.2706,1.2707,1.2702,1.2702
   3,2009-02-23 17:35:00,1.2704,1.2706,1.27,1.27
   4,2009-02-23 17:36:00,1.2701,1.2706,1.2698,1.2703
   5,2009-02-23 17:37:00,1.2703,1.2703,1.27,1.2702
   6,2009-02-23 17:38:00,1.2701,1.2701,1.2696,1.2697

输出文件示例：

   minit,date,open,high,low,close
   0,20090223,1.2708,1.2708,1.2706,1.2706
   1,20090223,1.2708,1.2708,1.2705,1.2706
   2,20090223,1.2706,1.2707,1.2702,1.2702
   3,20090223,1.2704,1.2706,1.27,1.27
   4,20090223,1.2701,1.2706,1.2698,1.2703
   5,20090223,1.2703,1.2703,1.27,1.2702
   6,20090223,1.2701,1.2701,1.2696,1.2697

我在谷歌搜索后开始编写示例代码来完成此任务：

     import csv
     import itertools
     import operator
     import time
     import datetime
     import pandas as pd
     from pandas import DataFrame, Timestamp
     from numpy import *

     def datestring_to_timestamp(str):
         return time.mktime(time.strptime(str, "%Y-%m-%d %H:%M:%S"))

     def timestamp_to_datestring(timestamp):
        return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(timestamp))

     def timestamp_to_float(str):
        return float(datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s"))

     def timestamp_to_intstring(str):
        return datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s")

    def timestamp_to_int(str):
        return int(datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s"))

    with open("inputfile.csv", 'rb') as input, open('outputfile.csv', 'wb') as output:
       reader = csv.reader(input, delimiter = ',')
       writer = csv.writer(output, delimiter = ',')

    # Need to process loop or process the timestamp column

Answer 1

您可以在 to_csv 的参数中指定日期格式字符串，这将按照您的喜好输出日期，无需 extract/convert/add 新列等

所以加载数据使用read_csv:

df = pd.read_csv('mydata.csv', parse_dates=['timestamp']

In [15]:

df
Out[15]:
   minit           timestamp    open    high     low   close
0      0 2009-02-23 17:32:00  1.2708  1.2708  1.2706  1.2706
1      1 2009-02-23 17:33:00  1.2708  1.2708  1.2705  1.2706
2      2 2009-02-23 17:34:00  1.2706  1.2707  1.2702  1.2702
3      3 2009-02-23 17:35:00  1.2704  1.2706  1.2700  1.2700
4      4 2009-02-23 17:36:00  1.2701  1.2706  1.2698  1.2703
5      5 2009-02-23 17:37:00  1.2703  1.2703  1.2700  1.2702
6      6 2009-02-23 17:38:00  1.2701  1.2701  1.2696  1.2697

您可以在此阶段重命名该列，然后我们可以传递参数 date_format='%Y%m%d' toto_csv`，这只会将日期部分输出到 csv，我们可以重新加载它并显示它保存的内容：

In [19]:

df.rename(columns={'timestamp':'date'},inplace=True)
df.to_csv(r'c:\data\date.csv', date_format='%Y%m%d')
df1 = pd.read_csv(r'C:\data\date.csv', index_col=[0])
df1
Out[19]:
   minit      date    open    high     low   close
0      0  20090223  1.2708  1.2708  1.2706  1.2706
1      1  20090223  1.2708  1.2708  1.2705  1.2706
2      2  20090223  1.2706  1.2707  1.2702  1.2702
3      3  20090223  1.2704  1.2706  1.2700  1.2700
4      4  20090223  1.2701  1.2706  1.2698  1.2703
5      5  20090223  1.2703  1.2703  1.2700  1.2702
6      6  20090223  1.2701  1.2701  1.2696  1.2697

使用 python 和 pandas 将时间戳列拆分为 CSV 中的两个新列

Split timestamp column into two new columns in CSV using python and pandas

python

csv

numpy

itertools

pandas