pandas 时间序列多切片
pandas time series multiple slice
我从 pandas 文档中看到,你可以去:
df.loc[['a','b','c'],:]
时间序列,为什么不能去:
x = df.loc[['2005-10-27 14:30':'2005-10-27 15:15', '2006-04-14 14:40':'2006-04-14 15:20', '2008-01-25 14:30':'2008-01-25 15:30'],:]
我收到语法错误。你不能在时间序列上做多个切片范围吗?有解决方法吗?
虽然 DataFrame 索引将接受列索引列表,但它不会接受行切片对象列表。
这应该做你想做的,它循环遍历你想要的范围,编译一个新的 DataFrame。
import numpy as np
import pandas as pd
# let's create some fake data
date_range = pd.date_range('2005-01-01', '2008-12-31', freq='9min')
l = len(date_range)
df = pd.DataFrame({'normal': np.random.randn(l), 'uniform':np.random.rand(l),
'datetime':date_range, 'integer':range(l)}, index=date_range)
# let's identify the periods we want
desired = [('2005-10-27 14:30','2005-10-27 15:15'),
('2006-04-14 14:40','2006-04-14 15:20'),
('2008-01-25 14:30','2008-01-25 15:30')]
# let's loop through the desired ranges and compile our selection
x = pd.DataFrame()
for (start, stop) in desired:
selection = df[(df.index >= pd.Timestamp(start)) &
(df.index <= pd.Timestamp(stop))]
x = x.append(selection)
# and let's have a look at what we found ...
print(x)
这个 提到了 numpy.r_ 但我不知道如何让它与切片列表一起工作所以我使用了 hstack 和 arange
import numpy as np
import pandas as pd
def loop_version(df, desired):
# let's loop through the desired ranges and compile our selection
x = pd.DataFrame()
for (start, stop) in desired:
selection = df[(df.index >= pd.Timestamp(start)) &
(df.index <= pd.Timestamp(stop))]
x = x.append(selection)
# and let's have a look at what we found ...
return x
def vectorized_version(df, desired):
# first flatten the list
times = np.array(desired).flatten()
# use searchsorted to find the indices of the
# desired times in df's index
ndxlist = df.index.searchsorted(np.array(times))
# use np.arange to convert pairs of values in ndxlist to a
# range of indices, similar to np.r_
ndxlist = np.hstack([np.arange(i1, i2) for i1, i2 in
zip(ndxlist[::2], ndxlist[1::2])])
return df.iloc[ndxlist]
In [2]: # let's create some fake data
In [3]: date_range = pd.date_range('2005-01-01', '2008-12-31', freq='9min')
In [4]: l = len(date_range)
In [5]: df = pd.DataFrame({'normal': np.random.randn(l), 'uniform':np.random.rand(l),
...: 'datetime':date_range, 'integer':range(l)}, index=date_range)
In [6]: # let's identify the periods we want
...: desired = [('2005-10-27 14:30','2005-10-27 15:15'),
...: ('2006-04-14 14:40','2006-04-14 15:20'),
...: ('2008-01-25 14:30','2008-01-25 15:30')]
In [7]: loop_version(df, desired).equals(vectorized_version(df, desired))
Out[7]: True
In [8]: % timeit loop_version(df, desired)
5.53 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [9]: % timeit vectorized_version(df, desired)
308 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我从 pandas 文档中看到,你可以去:
df.loc[['a','b','c'],:]
时间序列,为什么不能去:
x = df.loc[['2005-10-27 14:30':'2005-10-27 15:15', '2006-04-14 14:40':'2006-04-14 15:20', '2008-01-25 14:30':'2008-01-25 15:30'],:]
我收到语法错误。你不能在时间序列上做多个切片范围吗?有解决方法吗?
虽然 DataFrame 索引将接受列索引列表,但它不会接受行切片对象列表。
这应该做你想做的,它循环遍历你想要的范围,编译一个新的 DataFrame。
import numpy as np
import pandas as pd
# let's create some fake data
date_range = pd.date_range('2005-01-01', '2008-12-31', freq='9min')
l = len(date_range)
df = pd.DataFrame({'normal': np.random.randn(l), 'uniform':np.random.rand(l),
'datetime':date_range, 'integer':range(l)}, index=date_range)
# let's identify the periods we want
desired = [('2005-10-27 14:30','2005-10-27 15:15'),
('2006-04-14 14:40','2006-04-14 15:20'),
('2008-01-25 14:30','2008-01-25 15:30')]
# let's loop through the desired ranges and compile our selection
x = pd.DataFrame()
for (start, stop) in desired:
selection = df[(df.index >= pd.Timestamp(start)) &
(df.index <= pd.Timestamp(stop))]
x = x.append(selection)
# and let's have a look at what we found ...
print(x)
这个
import numpy as np
import pandas as pd
def loop_version(df, desired):
# let's loop through the desired ranges and compile our selection
x = pd.DataFrame()
for (start, stop) in desired:
selection = df[(df.index >= pd.Timestamp(start)) &
(df.index <= pd.Timestamp(stop))]
x = x.append(selection)
# and let's have a look at what we found ...
return x
def vectorized_version(df, desired):
# first flatten the list
times = np.array(desired).flatten()
# use searchsorted to find the indices of the
# desired times in df's index
ndxlist = df.index.searchsorted(np.array(times))
# use np.arange to convert pairs of values in ndxlist to a
# range of indices, similar to np.r_
ndxlist = np.hstack([np.arange(i1, i2) for i1, i2 in
zip(ndxlist[::2], ndxlist[1::2])])
return df.iloc[ndxlist]
In [2]: # let's create some fake data
In [3]: date_range = pd.date_range('2005-01-01', '2008-12-31', freq='9min')
In [4]: l = len(date_range)
In [5]: df = pd.DataFrame({'normal': np.random.randn(l), 'uniform':np.random.rand(l),
...: 'datetime':date_range, 'integer':range(l)}, index=date_range)
In [6]: # let's identify the periods we want
...: desired = [('2005-10-27 14:30','2005-10-27 15:15'),
...: ('2006-04-14 14:40','2006-04-14 15:20'),
...: ('2008-01-25 14:30','2008-01-25 15:30')]
In [7]: loop_version(df, desired).equals(vectorized_version(df, desired))
Out[7]: True
In [8]: % timeit loop_version(df, desired)
5.53 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [9]: % timeit vectorized_version(df, desired)
308 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)