更改 DataFrame 中的索引数?
Changing number of index in DataFrame?
我正在尝试更改以下代码的输出:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Panel, bdate_range, DatetimeIndex, date_range
from pandas.tseries.holiday import get_calendar
from datetime import datetime, timedelta
import pytz as pytz
from pytz import timezone
start = datetime(2013, 1, 1)
hr1 = np.loadtxt("Spot_2013_Hour1.txt")
index = date_range(start, end = '2013-12-31', freq='B')
Allhrs = Series(index)
Allhrs = DataFrame({'hr1': hr1})
df = Allhrs
indexed_df = df.set_index(index)
print indexed_df
错误:
File "<ipython-input-61-c7890d8ccb07>", line 17, in <module>
indexed_df = df.set_index(index)
File "/Applications/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 2390, in set_index
frame.index = index
File "/Applications/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1849, in __setattr__
object.__setattr__(self, name, value)
File "properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:38491)
File "/Applications/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 400, in _set_axis
self._data.set_axis(axis, labels)
File "/Applications/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 1965, in set_axis
'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 365 elements, new values have 261 elements
问题:
我有一个从 txt 文件加载的时间序列。时间序列由 365 个元素组成,即 2013 年的所有日子。我需要这个 txt 文件,因为我需要分析每一天。
此外,我需要分析 2013 年的特定日期。所以我想更改数据的读取方式,即我只想查看工作日。此外,see/print 特定日期会很棒。
感谢帮助
首先,创建一个包含 一年中所有日子 的 DataFrame(或系列):
index = date_range(start='2013-1-1', end='2013-12-31', freq='D')
df = pd.DataFrame(hr1, index=index)
接下来,使用 df.asfreq('B')
将采样 df
缩减到工作日:
import numpy as np
import pandas as pd
# hr1 = np.loadtxt("Spot_2013_Hour1.txt")
hr1 = np.random.random(365)
index = date_range(start='2013-1-1', end='2013-12-31', freq='D')
df = pd.DataFrame(hr1, index=index)
indexed_df = df.asfreq('B')
print(indexed_df)
要将频率设置为工作日而排除某些日期,您可以使用 offsets.CustomBusinessDay
:
import pandas.tseries.offsets as offsets
holidays = ['2013-10-03' , '2013-12-25']
business_days = offsets.CustomBusinessDay(holidays=holidays)
custom_df = df.asfreq(business_days)
因此,custom_df
比indexed_df
少两天
In [12]: len(custom_df)
Out[12]: 259
In [13]: len(indexed_df)
Out[13]: 261
和 "holidays" 如 '2013-10-03'
缺失:
In [18]: '2013-10-03' in indexed_df.index
Out[18]: True
In [19]: '2013-10-03' in custom_df.index
Out[19]: False
了解 the reindex
method 可用于子选择行也很有用。例如,您可以从 indexed_df.index
:
中减去特定天数
idx = indexed_df.index - pd.DatetimeIndex(holidays)
custom_df2 = df.reindex(idx)
结果,custom_df2
等于custom_df
:
In [35]: custom_df2.equals(custom_df)
Out[35]: True
但请注意索引有点不同:
In [36]: custom_df.index
Out[36]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-12-31]
Length: 259, Freq: C, Timezone: None
In [37]: custom_df2.index
Out[37]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-12-31]
Length: 259, Freq: None, Timezone: None
custom_df
为Freq: C
,而custom_df2
有Freq: None
。 freq
被某些方法使用,例如 snap
和 to_period
。但是这些方法还允许您将所需的频率指定为参数,因此在实践中我没有发现这种差异有什么大不了的。
我正在尝试更改以下代码的输出:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Panel, bdate_range, DatetimeIndex, date_range
from pandas.tseries.holiday import get_calendar
from datetime import datetime, timedelta
import pytz as pytz
from pytz import timezone
start = datetime(2013, 1, 1)
hr1 = np.loadtxt("Spot_2013_Hour1.txt")
index = date_range(start, end = '2013-12-31', freq='B')
Allhrs = Series(index)
Allhrs = DataFrame({'hr1': hr1})
df = Allhrs
indexed_df = df.set_index(index)
print indexed_df
错误:
File "<ipython-input-61-c7890d8ccb07>", line 17, in <module>
indexed_df = df.set_index(index)
File "/Applications/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 2390, in set_index
frame.index = index
File "/Applications/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1849, in __setattr__
object.__setattr__(self, name, value)
File "properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:38491)
File "/Applications/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 400, in _set_axis
self._data.set_axis(axis, labels)
File "/Applications/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 1965, in set_axis
'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 365 elements, new values have 261 elements
问题:
我有一个从 txt 文件加载的时间序列。时间序列由 365 个元素组成,即 2013 年的所有日子。我需要这个 txt 文件,因为我需要分析每一天。
此外,我需要分析 2013 年的特定日期。所以我想更改数据的读取方式,即我只想查看工作日。此外,see/print 特定日期会很棒。
感谢帮助
首先,创建一个包含 一年中所有日子 的 DataFrame(或系列):
index = date_range(start='2013-1-1', end='2013-12-31', freq='D')
df = pd.DataFrame(hr1, index=index)
接下来,使用 df.asfreq('B')
将采样 df
缩减到工作日:
import numpy as np
import pandas as pd
# hr1 = np.loadtxt("Spot_2013_Hour1.txt")
hr1 = np.random.random(365)
index = date_range(start='2013-1-1', end='2013-12-31', freq='D')
df = pd.DataFrame(hr1, index=index)
indexed_df = df.asfreq('B')
print(indexed_df)
要将频率设置为工作日而排除某些日期,您可以使用 offsets.CustomBusinessDay
:
import pandas.tseries.offsets as offsets
holidays = ['2013-10-03' , '2013-12-25']
business_days = offsets.CustomBusinessDay(holidays=holidays)
custom_df = df.asfreq(business_days)
因此,custom_df
比indexed_df
In [12]: len(custom_df)
Out[12]: 259
In [13]: len(indexed_df)
Out[13]: 261
和 "holidays" 如 '2013-10-03'
缺失:
In [18]: '2013-10-03' in indexed_df.index
Out[18]: True
In [19]: '2013-10-03' in custom_df.index
Out[19]: False
了解 the reindex
method 可用于子选择行也很有用。例如,您可以从 indexed_df.index
:
idx = indexed_df.index - pd.DatetimeIndex(holidays)
custom_df2 = df.reindex(idx)
结果,custom_df2
等于custom_df
:
In [35]: custom_df2.equals(custom_df)
Out[35]: True
但请注意索引有点不同:
In [36]: custom_df.index
Out[36]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-12-31]
Length: 259, Freq: C, Timezone: None
In [37]: custom_df2.index
Out[37]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-12-31]
Length: 259, Freq: None, Timezone: None
custom_df
为Freq: C
,而custom_df2
有Freq: None
。 freq
被某些方法使用,例如 snap
和 to_period
。但是这些方法还允许您将所需的频率指定为参数,因此在实践中我没有发现这种差异有什么大不了的。