使用 python 和 pandas 将时间戳列拆分为 CSV 中的两个新列
Split timestamp column into two new columns in CSV using python and pandas
我有一个超过 210000 行的大型 CSV 文件。我是 python 和 pandas 的新手。我想有效地遍历时间戳列,将时间戳列拆分为 2 个新列(日期和时间),然后将新日期列格式化为 %Y%m%d 并删除新时间列。即只写回CSV 文件新格式化的日期列。你是怎么做到的?
输入文件示例:
minit,timestamp,open,high,low,close
0,2009-02-23 17:32:00,1.2708,1.2708,1.2706,1.2706
1,2009-02-23 17:33:00,1.2708,1.2708,1.2705,1.2706
2,2009-02-23 17:34:00,1.2706,1.2707,1.2702,1.2702
3,2009-02-23 17:35:00,1.2704,1.2706,1.27,1.27
4,2009-02-23 17:36:00,1.2701,1.2706,1.2698,1.2703
5,2009-02-23 17:37:00,1.2703,1.2703,1.27,1.2702
6,2009-02-23 17:38:00,1.2701,1.2701,1.2696,1.2697
输出文件示例:
minit,date,open,high,low,close
0,20090223,1.2708,1.2708,1.2706,1.2706
1,20090223,1.2708,1.2708,1.2705,1.2706
2,20090223,1.2706,1.2707,1.2702,1.2702
3,20090223,1.2704,1.2706,1.27,1.27
4,20090223,1.2701,1.2706,1.2698,1.2703
5,20090223,1.2703,1.2703,1.27,1.2702
6,20090223,1.2701,1.2701,1.2696,1.2697
我在谷歌搜索后开始编写示例代码来完成此任务:
import csv
import itertools
import operator
import time
import datetime
import pandas as pd
from pandas import DataFrame, Timestamp
from numpy import *
def datestring_to_timestamp(str):
return time.mktime(time.strptime(str, "%Y-%m-%d %H:%M:%S"))
def timestamp_to_datestring(timestamp):
return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(timestamp))
def timestamp_to_float(str):
return float(datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s"))
def timestamp_to_intstring(str):
return datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s")
def timestamp_to_int(str):
return int(datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s"))
with open("inputfile.csv", 'rb') as input, open('outputfile.csv', 'wb') as output:
reader = csv.reader(input, delimiter = ',')
writer = csv.writer(output, delimiter = ',')
# Need to process loop or process the timestamp column
您可以在 to_csv
的参数中指定日期格式字符串,这将按照您的喜好输出日期,无需 extract/convert/add 新列等
所以加载数据使用read_csv
:
df = pd.read_csv('mydata.csv', parse_dates=['timestamp']
In [15]:
df
Out[15]:
minit timestamp open high low close
0 0 2009-02-23 17:32:00 1.2708 1.2708 1.2706 1.2706
1 1 2009-02-23 17:33:00 1.2708 1.2708 1.2705 1.2706
2 2 2009-02-23 17:34:00 1.2706 1.2707 1.2702 1.2702
3 3 2009-02-23 17:35:00 1.2704 1.2706 1.2700 1.2700
4 4 2009-02-23 17:36:00 1.2701 1.2706 1.2698 1.2703
5 5 2009-02-23 17:37:00 1.2703 1.2703 1.2700 1.2702
6 6 2009-02-23 17:38:00 1.2701 1.2701 1.2696 1.2697
您可以在此阶段重命名该列,然后我们可以传递参数 date_format='%Y%m%d' to
to_csv`,这只会将日期部分输出到 csv,我们可以重新加载它并显示它保存的内容:
In [19]:
df.rename(columns={'timestamp':'date'},inplace=True)
df.to_csv(r'c:\data\date.csv', date_format='%Y%m%d')
df1 = pd.read_csv(r'C:\data\date.csv', index_col=[0])
df1
Out[19]:
minit date open high low close
0 0 20090223 1.2708 1.2708 1.2706 1.2706
1 1 20090223 1.2708 1.2708 1.2705 1.2706
2 2 20090223 1.2706 1.2707 1.2702 1.2702
3 3 20090223 1.2704 1.2706 1.2700 1.2700
4 4 20090223 1.2701 1.2706 1.2698 1.2703
5 5 20090223 1.2703 1.2703 1.2700 1.2702
6 6 20090223 1.2701 1.2701 1.2696 1.2697
我有一个超过 210000 行的大型 CSV 文件。我是 python 和 pandas 的新手。我想有效地遍历时间戳列,将时间戳列拆分为 2 个新列(日期和时间),然后将新日期列格式化为 %Y%m%d 并删除新时间列。即只写回CSV 文件新格式化的日期列。你是怎么做到的?
输入文件示例:
minit,timestamp,open,high,low,close
0,2009-02-23 17:32:00,1.2708,1.2708,1.2706,1.2706
1,2009-02-23 17:33:00,1.2708,1.2708,1.2705,1.2706
2,2009-02-23 17:34:00,1.2706,1.2707,1.2702,1.2702
3,2009-02-23 17:35:00,1.2704,1.2706,1.27,1.27
4,2009-02-23 17:36:00,1.2701,1.2706,1.2698,1.2703
5,2009-02-23 17:37:00,1.2703,1.2703,1.27,1.2702
6,2009-02-23 17:38:00,1.2701,1.2701,1.2696,1.2697
输出文件示例:
minit,date,open,high,low,close
0,20090223,1.2708,1.2708,1.2706,1.2706
1,20090223,1.2708,1.2708,1.2705,1.2706
2,20090223,1.2706,1.2707,1.2702,1.2702
3,20090223,1.2704,1.2706,1.27,1.27
4,20090223,1.2701,1.2706,1.2698,1.2703
5,20090223,1.2703,1.2703,1.27,1.2702
6,20090223,1.2701,1.2701,1.2696,1.2697
我在谷歌搜索后开始编写示例代码来完成此任务:
import csv
import itertools
import operator
import time
import datetime
import pandas as pd
from pandas import DataFrame, Timestamp
from numpy import *
def datestring_to_timestamp(str):
return time.mktime(time.strptime(str, "%Y-%m-%d %H:%M:%S"))
def timestamp_to_datestring(timestamp):
return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(timestamp))
def timestamp_to_float(str):
return float(datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s"))
def timestamp_to_intstring(str):
return datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s")
def timestamp_to_int(str):
return int(datetime.datetime.strptime(str, '%Y-%m-%d %H:%M:%S').strftime("%s"))
with open("inputfile.csv", 'rb') as input, open('outputfile.csv', 'wb') as output:
reader = csv.reader(input, delimiter = ',')
writer = csv.writer(output, delimiter = ',')
# Need to process loop or process the timestamp column
您可以在 to_csv
的参数中指定日期格式字符串,这将按照您的喜好输出日期,无需 extract/convert/add 新列等
所以加载数据使用read_csv
:
df = pd.read_csv('mydata.csv', parse_dates=['timestamp']
In [15]:
df
Out[15]:
minit timestamp open high low close
0 0 2009-02-23 17:32:00 1.2708 1.2708 1.2706 1.2706
1 1 2009-02-23 17:33:00 1.2708 1.2708 1.2705 1.2706
2 2 2009-02-23 17:34:00 1.2706 1.2707 1.2702 1.2702
3 3 2009-02-23 17:35:00 1.2704 1.2706 1.2700 1.2700
4 4 2009-02-23 17:36:00 1.2701 1.2706 1.2698 1.2703
5 5 2009-02-23 17:37:00 1.2703 1.2703 1.2700 1.2702
6 6 2009-02-23 17:38:00 1.2701 1.2701 1.2696 1.2697
您可以在此阶段重命名该列,然后我们可以传递参数 date_format='%Y%m%d' to
to_csv`,这只会将日期部分输出到 csv,我们可以重新加载它并显示它保存的内容:
In [19]:
df.rename(columns={'timestamp':'date'},inplace=True)
df.to_csv(r'c:\data\date.csv', date_format='%Y%m%d')
df1 = pd.read_csv(r'C:\data\date.csv', index_col=[0])
df1
Out[19]:
minit date open high low close
0 0 20090223 1.2708 1.2708 1.2706 1.2706
1 1 20090223 1.2708 1.2708 1.2705 1.2706
2 2 20090223 1.2706 1.2707 1.2702 1.2702
3 3 20090223 1.2704 1.2706 1.2700 1.2700
4 4 20090223 1.2701 1.2706 1.2698 1.2703
5 5 20090223 1.2703 1.2703 1.2700 1.2702
6 6 20090223 1.2701 1.2701 1.2696 1.2697