分割数据集
Segmenting a dataset
给定一个包含日期和值的 CSV 数据集,我想尝试创建一个新的 CSV 数据集,其中输出由图形已更改的点组成:在 all.There 处增加、减少或未更改是来自数据的以下示例,以及所需的输出。 (CSV 下降到 1999 年)
Date Value
07/04/2014 137209.0
04/04/2014 137639.0
03/04/2014 137876.0
02/04/2014 137795.0
01/04/2014 137623.0
31/03/2014 137589.0
28/03/2014 137826.0
27/03/2014 138114.0
26/03/2014 138129.0
25/03/2014 137945.0
输出应该是:
StartDate EndDate StartValue EndValue
03/04/2014 07/04/2014 137876 137209
31/03/2014 03/04/2014 137589 137876
27/03/2014 31/03/2014 138114 137589
26/03/2014 27/03/2014 138129 138114
25/03/2014 26/03/2014 137945 138129
我尝试解决这个问题,涉及一个自写的 Stretch
class,它在添加数据时管理数据的拆分:
from enum import Enum
class Direction(Enum):
NA = None
Up = 1
Stagnant = 0
Down = -1
@staticmethod
def getDir(a,b):
"""Gets two numbers and returns a Direction result by comparing them."""
if a < b: return Direction.Up
elif a > b: return Direction.Down
else: return Direction.Stagnant
class Stretch:
"""Accepts tuples of (insignificant, float). Adds tuples to internal data struct
while they have the same trend (down, up, stagnant). See add() for details."""
def __init__(self,dp=None):
self.data = []
if dp:
self.data.append(dp)
self.dir = Direction.NA
def add(self,dp):
"""Adds dp to self if it follows a given trend (or it holds less then 2 datapts).
Returns (True,None) if the datapoint was added to this Stretch instance,
returns (False, new_stretch) if it broke the trend. The new_stretch
contains the new last value of the self.data as well as the new dp."""
if not self.data:
self.data.append(dp)
return True, None
if len(self.data) == 1:
self.dir = Direction.getDir(self.data[-1][1],dp[1])
self.data.append(dp)
return True, None
if Direction.getDir(self.data[-1][1],dp[1]) == self.dir:
self.data.append(dp)
return True, None
else:
k = Stretch(self.data[-1])
k.add(dp)
return False, k
演示文件:
with open("d.txt","w") as w:
w.write( """Date Value
07/04/2014 137209.0
04/04/2014 137639.0
03/04/2014 137876.0
02/04/2014 137795.0
01/04/2014 137623.0
31/03/2014 137589.0
28/03/2014 137826.0
27/03/2014 138114.0
26/03/2014 138129.0
25/03/2014 137945.0
""" )
用法:
data_stretches = []
with open("d.txt") as r:
S = Stretch()
for line in r:
try:
date,value = line.strip().split()
value = float(value)
except (IndexError, ValueError) as e:
print("Illegal line: '{}'".format(line))
continue
b, newstretch = S.add( (date,value) )
if not b:
data_stretches.append(S)
S = newstretch
data_stretches.append(S)
for s in data_stretches:
data = s.data
direc = s.dir
print(data[0][0], data[-1][0], data[0][1],data[-1][-1], s.dir)
输出:
# EndDate StartDate EndV StartV (reversed b/c I inverted dates)
07/04/2014 03/04/2014 137209.0 137876.0 Direction.Up
03/04/2014 31/03/2014 137876.0 137589.0 Direction.Down
31/03/2014 26/03/2014 137589.0 138129.0 Direction.Up
26/03/2014 25/03/2014 138129.0 137945.0 Direction.Down
除了根据 "from when to when" 评估数据的方向混乱之外,我的输出与你的不同......因为你将一个统一序列分成两部分 w/o 显而易见的原因:
27/03/2014 31/03/2014 138114 137589 # further down
26/03/2014 27/03/2014 138129 138114 # down
您可以使用 numpy
中的 sign
并将其应用到列 'Value' 上的 diff
以查看图形趋势变化的位置,然后创建一个每组趋势的增量值 shift
和 cumsum
:
ser_sign = np.sign(df.Value.diff(-1).ffill())
ser_gr = ser_gr =(ser_sign.shift() != ser_sign).cumsum()
现在您知道了组,要获取每个组的开始和结束,您可以在 ser_gr
、join
和 last
上使用 groupby
(在 shift
之后 ser_gr
中的值作为每组的最后一个是下一组的第一个)和 first
.
df_new = (df.groupby(ser_gr.shift().bfill(),as_index=False).last()
.join(df.groupby(ser_gr,as_index=False).first(),lsuffix='_start',rsuffix='_end'))
print (df_new)
Date_start Value_start Date_end Value_end
0 03/04/2014 137876.0 07/04/2014 137209.0
1 31/03/2014 137589.0 03/04/2014 137876.0
2 26/03/2014 138129.0 31/03/2014 137589.0
3 25/03/2014 137945.0 26/03/2014 138129.0
现在,如果您需要对列重新排序并重命名,您可以使用:
df_new.columns = ['StartDate', 'StartValue', 'EndDate', 'EndValue']
df_new = df_new[['StartDate','EndDate','StartValue','EndValue']]
print (df_new)
StartDate EndDate StartValue EndValue
0 03/04/2014 07/04/2014 137876.0 137209.0
1 31/03/2014 03/04/2014 137589.0 137876.0
2 26/03/2014 31/03/2014 138129.0 137589.0
3 25/03/2014 26/03/2014 137945.0 138129.0
这两个操作可以同时完成,而不是使用 rename
创建 df_new
。
给定一个包含日期和值的 CSV 数据集,我想尝试创建一个新的 CSV 数据集,其中输出由图形已更改的点组成:在 all.There 处增加、减少或未更改是来自数据的以下示例,以及所需的输出。 (CSV 下降到 1999 年)
Date Value
07/04/2014 137209.0
04/04/2014 137639.0
03/04/2014 137876.0
02/04/2014 137795.0
01/04/2014 137623.0
31/03/2014 137589.0
28/03/2014 137826.0
27/03/2014 138114.0
26/03/2014 138129.0
25/03/2014 137945.0
输出应该是:
StartDate EndDate StartValue EndValue
03/04/2014 07/04/2014 137876 137209
31/03/2014 03/04/2014 137589 137876
27/03/2014 31/03/2014 138114 137589
26/03/2014 27/03/2014 138129 138114
25/03/2014 26/03/2014 137945 138129
我尝试解决这个问题,涉及一个自写的 Stretch
class,它在添加数据时管理数据的拆分:
from enum import Enum
class Direction(Enum):
NA = None
Up = 1
Stagnant = 0
Down = -1
@staticmethod
def getDir(a,b):
"""Gets two numbers and returns a Direction result by comparing them."""
if a < b: return Direction.Up
elif a > b: return Direction.Down
else: return Direction.Stagnant
class Stretch:
"""Accepts tuples of (insignificant, float). Adds tuples to internal data struct
while they have the same trend (down, up, stagnant). See add() for details."""
def __init__(self,dp=None):
self.data = []
if dp:
self.data.append(dp)
self.dir = Direction.NA
def add(self,dp):
"""Adds dp to self if it follows a given trend (or it holds less then 2 datapts).
Returns (True,None) if the datapoint was added to this Stretch instance,
returns (False, new_stretch) if it broke the trend. The new_stretch
contains the new last value of the self.data as well as the new dp."""
if not self.data:
self.data.append(dp)
return True, None
if len(self.data) == 1:
self.dir = Direction.getDir(self.data[-1][1],dp[1])
self.data.append(dp)
return True, None
if Direction.getDir(self.data[-1][1],dp[1]) == self.dir:
self.data.append(dp)
return True, None
else:
k = Stretch(self.data[-1])
k.add(dp)
return False, k
演示文件:
with open("d.txt","w") as w:
w.write( """Date Value
07/04/2014 137209.0
04/04/2014 137639.0
03/04/2014 137876.0
02/04/2014 137795.0
01/04/2014 137623.0
31/03/2014 137589.0
28/03/2014 137826.0
27/03/2014 138114.0
26/03/2014 138129.0
25/03/2014 137945.0
""" )
用法:
data_stretches = []
with open("d.txt") as r:
S = Stretch()
for line in r:
try:
date,value = line.strip().split()
value = float(value)
except (IndexError, ValueError) as e:
print("Illegal line: '{}'".format(line))
continue
b, newstretch = S.add( (date,value) )
if not b:
data_stretches.append(S)
S = newstretch
data_stretches.append(S)
for s in data_stretches:
data = s.data
direc = s.dir
print(data[0][0], data[-1][0], data[0][1],data[-1][-1], s.dir)
输出:
# EndDate StartDate EndV StartV (reversed b/c I inverted dates)
07/04/2014 03/04/2014 137209.0 137876.0 Direction.Up
03/04/2014 31/03/2014 137876.0 137589.0 Direction.Down
31/03/2014 26/03/2014 137589.0 138129.0 Direction.Up
26/03/2014 25/03/2014 138129.0 137945.0 Direction.Down
除了根据 "from when to when" 评估数据的方向混乱之外,我的输出与你的不同......因为你将一个统一序列分成两部分 w/o 显而易见的原因:
27/03/2014 31/03/2014 138114 137589 # further down 26/03/2014 27/03/2014 138129 138114 # down
您可以使用 numpy
中的 sign
并将其应用到列 'Value' 上的 diff
以查看图形趋势变化的位置,然后创建一个每组趋势的增量值 shift
和 cumsum
:
ser_sign = np.sign(df.Value.diff(-1).ffill())
ser_gr = ser_gr =(ser_sign.shift() != ser_sign).cumsum()
现在您知道了组,要获取每个组的开始和结束,您可以在 ser_gr
、join
和 last
上使用 groupby
(在 shift
之后 ser_gr
中的值作为每组的最后一个是下一组的第一个)和 first
.
df_new = (df.groupby(ser_gr.shift().bfill(),as_index=False).last()
.join(df.groupby(ser_gr,as_index=False).first(),lsuffix='_start',rsuffix='_end'))
print (df_new)
Date_start Value_start Date_end Value_end
0 03/04/2014 137876.0 07/04/2014 137209.0
1 31/03/2014 137589.0 03/04/2014 137876.0
2 26/03/2014 138129.0 31/03/2014 137589.0
3 25/03/2014 137945.0 26/03/2014 138129.0
现在,如果您需要对列重新排序并重命名,您可以使用:
df_new.columns = ['StartDate', 'StartValue', 'EndDate', 'EndValue']
df_new = df_new[['StartDate','EndDate','StartValue','EndValue']]
print (df_new)
StartDate EndDate StartValue EndValue
0 03/04/2014 07/04/2014 137876.0 137209.0
1 31/03/2014 03/04/2014 137589.0 137876.0
2 26/03/2014 31/03/2014 138129.0 137589.0
3 25/03/2014 26/03/2014 137945.0 138129.0
这两个操作可以同时完成,而不是使用 rename
创建 df_new
。