两个设备故障之间的日期差异
Date Difference Between Two Device Failures
我正在尝试计算 failures
之间的天数。我想知道系列中的每一天,自上次 failure
(其中 failure = 1
)以来经过的天数。可能有 1 到 1500 台设备。
例如,我希望我的数据框看起来像这样(请在第二个代码块中从 url 中提取数据。这只是一个较大数据框的简短示例。):
date device failure elapsed
10/01/2015 S1F0KYCR 1 0
10/07/2015 S1F0KYCR 1 7
10/08/2015 S1F0KYCR 0 0
10/09/2015 S1F0KYCR 0 0
10/17/2015 S1F0KYCR 1 11
10/31/2015 S1F0KYCR 0 0
10/01/2015 S8KLM011 1 0
10/02/2015 S8KLM011 1 2
10/07/2015 S8KLM011 0 0
10/09/2015 S8KLM011 0 0
10/11/2015 S8KLM011 0 0
10/21/2015 S8KLM011 1 20
示例代码:
编辑:请从下面的代码块中提取实际数据。上面的样本数据是一个简短的例子。谢谢。
url = "https://raw.githubusercontent.com/dsdaveh/device-failure-analysis/master/device_failure.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True) #Sort by date and device
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d') #format date to datetime
这就是我运行遇到障碍的地方。但是,新列应包含自上次 failure
以来的天数,其中 failure = 1
.
test['date'] = 0
for i in test.index[1:]:
if not test['failure'][i]:
test['elapsed'][i] = test['elapsed'][i-1] + 1
我也试过了
fails = df[df.failure==1]
fails.Dates = trues.index #need this because .diff() won't work on the index..
fails.Elapsed = trues.Dates.diff()
将 pandas.DataFrame.groupby
与 diff
和 apply
一起使用:
import pandas as pd
import numpy as np
df['date'] = pd.to_datetime(df['date'])
s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)
输出:
Date Device Failure Elapsed
0 2015-10-01 S1F0KYCR 1 0.0
1 2015-10-07 S1F0KYCR 1 7.0
2 2015-10-08 S1F0KYCR 0 0.0
3 2015-10-09 S1F0KYCR 0 0.0
4 2015-10-17 S1F0KYCR 1 11.0
5 2015-10-31 S1F0KYCR 0 0.0
6 2015-10-01 S8KLM011 1 0.0
7 2015-10-02 S8KLM011 1 2.0
8 2015-10-07 S8KLM011 0 0.0
9 2015-10-09 S8KLM011 0 0.0
10 2015-10-11 S8KLM011 0 0.0
11 2015-10-21 S8KLM011 1 20.0
更新:
发现OP中链接的实际数据包含[=36=]没有个设备有两个以上失败个案例,使得最终结果全部零(即从未发生过第二次故障,因此 elapsed 无需计算)。使用 OP 的原始片段:
import pandas as pd
url = "http://aws-proserve-data-science.s3.amazonaws.com/device_failure.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True)
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d')
查找是否有任何设备有超过 1 个故障:
df.groupby(['device'])['failure'].sum().gt(1).any()
# False
这实际上证实了 df['elapsed']
中的全零实际上是正确答案:)
如果您稍微调整一下数据,它确实会像预期的那样产生 elapsed。
df.loc[6879, 'device'] = 'S1F0RRB1'
# Making two occurrence of failure for device S1F0RRB1
s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)
df['elapsed'].value_counts()
# 0.0 124493
# 3.0 1
这是一种方法
df['elapsed']=df[df.Failure.astype(bool)].groupby('Device').Date.diff().dt.days.add(1)
df.elapsed.fillna(0,inplace=True)
df
Out[225]:
Date Device Failure Elapsed elapsed
0 2015-10-01 S1F0KYCR 1 0 0.0
1 2015-10-07 S1F0KYCR 1 7 7.0
2 2015-10-08 S1F0KYCR 0 0 0.0
3 2015-10-09 S1F0KYCR 0 0 0.0
4 2015-10-17 S1F0KYCR 1 11 11.0
5 2015-10-31 S1F0KYCR 0 0 0.0
6 2015-10-01 S8KLM011 1 0 0.0
7 2015-10-02 S8KLM011 1 2 2.0
8 2015-10-07 S8KLM011 0 0 0.0
9 2015-10-09 S8KLM011 0 0 0.0
10 2015-10-11 S8KLM011 0 0 0.0
11 2015-10-21 S8KLM011 1 20 20.0
我正在尝试计算 failures
之间的天数。我想知道系列中的每一天,自上次 failure
(其中 failure = 1
)以来经过的天数。可能有 1 到 1500 台设备。
例如,我希望我的数据框看起来像这样(请在第二个代码块中从 url 中提取数据。这只是一个较大数据框的简短示例。):
date device failure elapsed
10/01/2015 S1F0KYCR 1 0
10/07/2015 S1F0KYCR 1 7
10/08/2015 S1F0KYCR 0 0
10/09/2015 S1F0KYCR 0 0
10/17/2015 S1F0KYCR 1 11
10/31/2015 S1F0KYCR 0 0
10/01/2015 S8KLM011 1 0
10/02/2015 S8KLM011 1 2
10/07/2015 S8KLM011 0 0
10/09/2015 S8KLM011 0 0
10/11/2015 S8KLM011 0 0
10/21/2015 S8KLM011 1 20
示例代码:
编辑:请从下面的代码块中提取实际数据。上面的样本数据是一个简短的例子。谢谢。
url = "https://raw.githubusercontent.com/dsdaveh/device-failure-analysis/master/device_failure.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True) #Sort by date and device
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d') #format date to datetime
这就是我运行遇到障碍的地方。但是,新列应包含自上次 failure
以来的天数,其中 failure = 1
.
test['date'] = 0
for i in test.index[1:]:
if not test['failure'][i]:
test['elapsed'][i] = test['elapsed'][i-1] + 1
我也试过了
fails = df[df.failure==1]
fails.Dates = trues.index #need this because .diff() won't work on the index..
fails.Elapsed = trues.Dates.diff()
将 pandas.DataFrame.groupby
与 diff
和 apply
一起使用:
import pandas as pd
import numpy as np
df['date'] = pd.to_datetime(df['date'])
s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)
输出:
Date Device Failure Elapsed
0 2015-10-01 S1F0KYCR 1 0.0
1 2015-10-07 S1F0KYCR 1 7.0
2 2015-10-08 S1F0KYCR 0 0.0
3 2015-10-09 S1F0KYCR 0 0.0
4 2015-10-17 S1F0KYCR 1 11.0
5 2015-10-31 S1F0KYCR 0 0.0
6 2015-10-01 S8KLM011 1 0.0
7 2015-10-02 S8KLM011 1 2.0
8 2015-10-07 S8KLM011 0 0.0
9 2015-10-09 S8KLM011 0 0.0
10 2015-10-11 S8KLM011 0 0.0
11 2015-10-21 S8KLM011 1 20.0
更新:
发现OP中链接的实际数据包含[=36=]没有个设备有两个以上失败个案例,使得最终结果全部零(即从未发生过第二次故障,因此 elapsed 无需计算)。使用 OP 的原始片段:
import pandas as pd
url = "http://aws-proserve-data-science.s3.amazonaws.com/device_failure.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True)
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d')
查找是否有任何设备有超过 1 个故障:
df.groupby(['device'])['failure'].sum().gt(1).any()
# False
这实际上证实了 df['elapsed']
中的全零实际上是正确答案:)
如果您稍微调整一下数据,它确实会像预期的那样产生 elapsed。
df.loc[6879, 'device'] = 'S1F0RRB1'
# Making two occurrence of failure for device S1F0RRB1
s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)
df['elapsed'].value_counts()
# 0.0 124493
# 3.0 1
这是一种方法
df['elapsed']=df[df.Failure.astype(bool)].groupby('Device').Date.diff().dt.days.add(1)
df.elapsed.fillna(0,inplace=True)
df
Out[225]:
Date Device Failure Elapsed elapsed
0 2015-10-01 S1F0KYCR 1 0 0.0
1 2015-10-07 S1F0KYCR 1 7 7.0
2 2015-10-08 S1F0KYCR 0 0 0.0
3 2015-10-09 S1F0KYCR 0 0 0.0
4 2015-10-17 S1F0KYCR 1 11 11.0
5 2015-10-31 S1F0KYCR 0 0 0.0
6 2015-10-01 S8KLM011 1 0 0.0
7 2015-10-02 S8KLM011 1 2 2.0
8 2015-10-07 S8KLM011 0 0 0.0
9 2015-10-09 S8KLM011 0 0 0.0
10 2015-10-11 S8KLM011 0 0 0.0
11 2015-10-21 S8KLM011 1 20 20.0