R 中故障率和日期时间操作的计算
Calculation of failure rate and date time manipulation in R
我有一个正在使用的示例数据框
Datetime <- c("2015-09-29 08:22:00", "2015-09-29 09:45:00", "2015-09-29 09:53:00", "2015-09-29 10:22:00", "2015-09-29 10:42:00",
"2015-09-29 11:31:00", "2015-09-29 11:47:00", "2015-09-29 12:45:00", "2015-09-29 13:11:00", "2015-09-29 13:44:00",
"2015-09-29 15:24:00", "2015-09-29 16:28:00", "2015-09-29 20:22:00", "2015-09-29 21:38:00", "2015-09-29 23:34:00")
Measurement <- c("Length","Length","Width","Height","Width","Height","Length","Width","Width","Height","Width","Length",
"Length","Height","Height")
PASSFAIL <- c("PASS","PASS","FAIL","PASS","PASS","FAIL_AVG_HIGH","FAIL#Pts","FAIL","FAIL_AVG_LOW","FAIL","PASS","PASS","FAIL#RNG#HIGH","PASS","FAIL")
df1 <- data.frame(Datetime,Measurement,PASSFAIL)
df1
Datetime Measurement PASSFAIL
1 2015-09-29 08:22:00 Length PASS
2 2015-09-29 09:45:00 Length PASS
3 2015-09-29 09:53:00 Width FAIL
4 2015-09-29 10:22:00 Height PASS
5 2015-09-29 10:42:00 Width PASS
6 2015-09-29 11:31:00 Height FAIL_AVG_HIGH
7 2015-09-29 11:47:00 Length FAIL#Pts
8 2015-09-29 12:45:00 Width FAIL
9 2015-09-29 13:11:00 Width FAIL_AVG_LOW
10 2015-09-29 13:44:00 Height FAIL
11 2015-09-29 15:24:00 Width PASS
12 2015-09-29 16:28:00 Length PASS
13 2015-09-29 20:22:00 Length FAIL#RNG#HIGH
14 2015-09-29 21:38:00 Height PASS
15 2015-09-29 23:34:00 Height FAIL
我正在研究一个有趣的问题,以找出一天中 12 AM-12 PM 和 12 PM-12 AM(第二天)的每次测量的失败率。
注意:在 df1 中,任何在 PASSFAIL 列中有 FAIL 的都被视为失败。
Fail Rate = (Number of Fails)/(Number of Fails + Number of Pass)
我想要的输出是这样的
Datetime FailRate_length Total_length FailRate_Width Total_Width FailRate_Height Total_Height
1 2015-09-29 00:00:00 AM 0.33 3 0.50 2 0.50 2
2 2015-09-29 12:00:00 PM 0.50 2 0.66 3 0.66 3
我正在尝试使用 dplyr 和 data.table 包来解决这个问题,但我只是不知道如何划分 df1 中的时间间隔以获得具有 2 个值的 df2 -> 12AM(前 7 个观察值df1) 和 12PM(df1 中接下来的 8 个观测值)。有人可以帮我吗?
使用data.table...
library(data.table)
# thanks to @DavidArenburg for suggesting this approach:
df1[, `:=`(
d = as.IDate(Datetime),
antepost = c("am","pm")[1+(hour(Datetime) >= 12)] )
]
res <- setDT(df1)[ , .(
failrate = sum(PASSFAIL != "PASS")/.N,
N = .N
), by = .(d, antepost, Measurement)]
这给出
d antepost Measurement failrate N
1: 2015-09-29 am Length 0.3333333 3
2: 2015-09-29 am Width 0.5000000 2
3: 2015-09-29 am Height 0.5000000 2
4: 2015-09-29 pm Width 0.6666667 3
5: 2015-09-29 pm Height 0.6666667 3
6: 2015-09-29 pm Length 0.5000000 2
语法为DT[i,j,by]
,其中by
用于对变量进行分组; j
用于处理列。 :=
在 j
中创建新列。
重塑为 OP 所需的输出...
dcast(res, d + antepost ~ Measurement, value.var = c("failrate", "N"))
这给出了
d antepost failrate_Height failrate_Length failrate_Width N_Height N_Length N_Width
1: 2015-09-29 am 0.5000000 0.3333333 0.5000000 2 3 2
2: 2015-09-29 pm 0.6666667 0.5000000 0.6666667 3 2 3
感谢@Arun,这是一种一步完成所有操作的方法:
dcast(setDT(df1),
as.IDate(Datetime) + c("am","pm")[1+(hour(Datetime) >= 12)] ~ Measurement,
value.var = "PASSFAIL",
fun.agg = list(function(x) sum(x != "PASS")/length(x), length)
)
这给出了
Datetime Datetime_1 PASSFAIL_function_Height PASSFAIL_function_Length PASSFAIL_function_Width PASSFAIL_length_Height PASSFAIL_length_Length PASSFAIL_length_Width
1: 2015-09-29 am 0.5000000 0.3333333 0.5000000 2 3 2
2: 2015-09-29 pm 0.6666667 0.5000000 0.6666667 3 2 3
列名是根据 ~
部分的根变量和每个函数定义的第一个单词自动生成的。
一个 dplyr + tidyr 等价物(装箱略有不同,尽管上面那个很优雅):
library(plyr)
library(dplyr)
library(tidyr)
df1 %>%
mutate(
half_day =
Datetime %>%
as.POSIXct(tz = "UTC") %>%
round_any(60*60*12, f = floor) ) %>%
group_by(half_day, Measurement) %>%
summarize(Total = n(),
FailRate = sum(PASSFAIL != "PASS")/Total) %>%
gather(variable, value, FailRate, Total) %>%
unite(variable_new, variable, Measurement, sep = "_") %>%
spread(variable_new, value)
gather
、unite
、spread
序列是 dcast
的 tidyr 等价物。注意
half day * (12 hour/half day) * (60 min / hour) * (60 seconds/min) = 60*60*12 seconds
我有一个正在使用的示例数据框
Datetime <- c("2015-09-29 08:22:00", "2015-09-29 09:45:00", "2015-09-29 09:53:00", "2015-09-29 10:22:00", "2015-09-29 10:42:00",
"2015-09-29 11:31:00", "2015-09-29 11:47:00", "2015-09-29 12:45:00", "2015-09-29 13:11:00", "2015-09-29 13:44:00",
"2015-09-29 15:24:00", "2015-09-29 16:28:00", "2015-09-29 20:22:00", "2015-09-29 21:38:00", "2015-09-29 23:34:00")
Measurement <- c("Length","Length","Width","Height","Width","Height","Length","Width","Width","Height","Width","Length",
"Length","Height","Height")
PASSFAIL <- c("PASS","PASS","FAIL","PASS","PASS","FAIL_AVG_HIGH","FAIL#Pts","FAIL","FAIL_AVG_LOW","FAIL","PASS","PASS","FAIL#RNG#HIGH","PASS","FAIL")
df1 <- data.frame(Datetime,Measurement,PASSFAIL)
df1
Datetime Measurement PASSFAIL
1 2015-09-29 08:22:00 Length PASS
2 2015-09-29 09:45:00 Length PASS
3 2015-09-29 09:53:00 Width FAIL
4 2015-09-29 10:22:00 Height PASS
5 2015-09-29 10:42:00 Width PASS
6 2015-09-29 11:31:00 Height FAIL_AVG_HIGH
7 2015-09-29 11:47:00 Length FAIL#Pts
8 2015-09-29 12:45:00 Width FAIL
9 2015-09-29 13:11:00 Width FAIL_AVG_LOW
10 2015-09-29 13:44:00 Height FAIL
11 2015-09-29 15:24:00 Width PASS
12 2015-09-29 16:28:00 Length PASS
13 2015-09-29 20:22:00 Length FAIL#RNG#HIGH
14 2015-09-29 21:38:00 Height PASS
15 2015-09-29 23:34:00 Height FAIL
我正在研究一个有趣的问题,以找出一天中 12 AM-12 PM 和 12 PM-12 AM(第二天)的每次测量的失败率。
注意:在 df1 中,任何在 PASSFAIL 列中有 FAIL 的都被视为失败。
Fail Rate = (Number of Fails)/(Number of Fails + Number of Pass)
我想要的输出是这样的
Datetime FailRate_length Total_length FailRate_Width Total_Width FailRate_Height Total_Height
1 2015-09-29 00:00:00 AM 0.33 3 0.50 2 0.50 2
2 2015-09-29 12:00:00 PM 0.50 2 0.66 3 0.66 3
我正在尝试使用 dplyr 和 data.table 包来解决这个问题,但我只是不知道如何划分 df1 中的时间间隔以获得具有 2 个值的 df2 -> 12AM(前 7 个观察值df1) 和 12PM(df1 中接下来的 8 个观测值)。有人可以帮我吗?
使用data.table...
library(data.table)
# thanks to @DavidArenburg for suggesting this approach:
df1[, `:=`(
d = as.IDate(Datetime),
antepost = c("am","pm")[1+(hour(Datetime) >= 12)] )
]
res <- setDT(df1)[ , .(
failrate = sum(PASSFAIL != "PASS")/.N,
N = .N
), by = .(d, antepost, Measurement)]
这给出
d antepost Measurement failrate N
1: 2015-09-29 am Length 0.3333333 3
2: 2015-09-29 am Width 0.5000000 2
3: 2015-09-29 am Height 0.5000000 2
4: 2015-09-29 pm Width 0.6666667 3
5: 2015-09-29 pm Height 0.6666667 3
6: 2015-09-29 pm Length 0.5000000 2
语法为DT[i,j,by]
,其中by
用于对变量进行分组; j
用于处理列。 :=
在 j
中创建新列。
重塑为 OP 所需的输出...
dcast(res, d + antepost ~ Measurement, value.var = c("failrate", "N"))
这给出了
d antepost failrate_Height failrate_Length failrate_Width N_Height N_Length N_Width
1: 2015-09-29 am 0.5000000 0.3333333 0.5000000 2 3 2
2: 2015-09-29 pm 0.6666667 0.5000000 0.6666667 3 2 3
感谢@Arun,这是一种一步完成所有操作的方法:
dcast(setDT(df1),
as.IDate(Datetime) + c("am","pm")[1+(hour(Datetime) >= 12)] ~ Measurement,
value.var = "PASSFAIL",
fun.agg = list(function(x) sum(x != "PASS")/length(x), length)
)
这给出了
Datetime Datetime_1 PASSFAIL_function_Height PASSFAIL_function_Length PASSFAIL_function_Width PASSFAIL_length_Height PASSFAIL_length_Length PASSFAIL_length_Width
1: 2015-09-29 am 0.5000000 0.3333333 0.5000000 2 3 2
2: 2015-09-29 pm 0.6666667 0.5000000 0.6666667 3 2 3
列名是根据 ~
部分的根变量和每个函数定义的第一个单词自动生成的。
一个 dplyr + tidyr 等价物(装箱略有不同,尽管上面那个很优雅):
library(plyr)
library(dplyr)
library(tidyr)
df1 %>%
mutate(
half_day =
Datetime %>%
as.POSIXct(tz = "UTC") %>%
round_any(60*60*12, f = floor) ) %>%
group_by(half_day, Measurement) %>%
summarize(Total = n(),
FailRate = sum(PASSFAIL != "PASS")/Total) %>%
gather(variable, value, FailRate, Total) %>%
unite(variable_new, variable, Measurement, sep = "_") %>%
spread(variable_new, value)
gather
、unite
、spread
序列是 dcast
的 tidyr 等价物。注意
half day * (12 hour/half day) * (60 min / hour) * (60 seconds/min) = 60*60*12 seconds