r data.frame 创建条件列
r data.frame create a conditional column
我有如下数据。我想要一个名为 accuracy_level
的新专栏。我怎样才能完成它?我尝试了 if
但效果不佳。
如果
accuracy_percentage
在 +/-10% 之内,那么 accuracy_level
将是 "Good"
accuracy_percentage
在 +/-30% 之内,在 +/- 10% 之外,那么 accuracy_level
将是 "Bad"
accuracy_percentage
在 +/-30% 之外,那么 accuracy_level
将是 "Worst"
这是我的代码:
actuals=seq(0,10,0.1)
forecast=seq(10,0,-0.1)
data1=data.frame(actuals,forecast)
data1$diff=data1$actuals-data1$forecast
data1$accuracy_percentage=(data1$diff/data1$actuals)*100
if((data1$accuracy_percentage < 10)&(data1$accuracy_percentage > -10),data1$accuracy_level="good",)
我用了化合物ifelse
data1$accuracy_category <- ifelse(abs(data1$accuracy_percentage)<10, "Good",
ifelse(abs(data1$accuracy_percentage)<30, "Bad", "Worst"))
产量
> head(data1)
actuals forecast diff accuracy_percentage accuracy_category
1 0.0 10.0 -10.0 -Inf Worst
2 0.1 9.9 -9.8 -9800.000 Worst
3 0.2 9.8 -9.6 -4800.000 Worst
4 0.3 9.7 -9.4 -3133.333 Worst
5 0.4 9.6 -9.2 -2300.000 Worst
6 0.5 9.5 -9.0 -1800.000 Worst
正如@pierre-lafortune 所指出的,它更易于阅读但性能较差。本着 Knuth 的精神,我 运行 进行了一些测试。使用您的初始设置:
> system.time(data1$accuracy_category <- ifelse(abs(data1$accuracy_percentage)<10, "Good",
+ ifelse(abs(data1$accuracy_percentage)<30, "Bad", "Worst")))
user system elapsed
0 0 0
> system.time(data1$accuracy_level <- cut(abs(data1$accuracy_percentage), c(0, 10, 30, Inf), c("Good", "Bad", "Worst"), include.lowest=T))
user system elapsed
0.000 0.000 0.001
但这并不能说明什么。所以让我们 c运行k 吧 :) With
actuals=seq(0,100000,0.1)
forecast=seq(100000,0,-0.1)
我得到了
> system.time(data1$accuracy_category <- ifelse(abs(data1$accuracy_percentage)<10, "Good",
+ ifelse(abs(data1$accuracy_percentage)<30, "Bad", "Worst")))
user system elapsed
0.776 0.060 0.840
> system.time(data1$accuracy_level <- cut(abs(data1$accuracy_percentage), c(0, 10, 30, Inf), c("Good", "Bad", "Worst"), include.lowest=T))
user system elapsed
0.152 0.003 0.155
这确实表明 cut
在您扩展时性能会更高。综上所述,cut
即使可读性不差也更优雅,我赞成他的回答 :) ymmv.
data1$accuracy_level <- cut(abs(data1$accuracy_percentage), c(0, 10, 30, Inf), c("Good", "Bad", "Worst"), include.lowest=T)
# actuals forecast diff accuracy_percentage accuracy_level
# 19 1.8 8.2 -6.4 -355.55556 Worst
# 71 7.0 3.0 4.0 57.14286 Worst
# 57 5.6 4.4 1.2 21.42857 Bad
# 17 1.6 8.4 -6.8 -425.00000 Worst
# 92 9.1 0.9 8.2 90.10989 Worst
# 91 9.0 1.0 8.0 88.88889 Worst
# 13 1.2 8.8 -7.6 -633.33333 Worst
# 79 7.8 2.2 5.6 71.79487 Worst
# 44 4.3 5.7 -1.4 -32.55814 Worst
# 51 5.0 5.0 0.0 0.00000 Good
使用 cut
将提高速度和可扩展性。我们找到基于切点 c(0, 10, 30, Inf)
的准确率百分比的绝对值区间 abs
。并为小组提供标签。我们还为落在切点下限的 0.000
个案例添加了参数 include.lowest=TRUE
。
嵌套ifelse
语句是因为读出来容易理解。但如果你必须嵌套 10 个不同的条件,它很容易失控。
请注意,如果我们不需要新的标签名称,我们可以使用相关函数 findInterval
,它本质上会做同样的事情,只是将整数值分配为输出(即 1 2 3 4..
).
我有如下数据。我想要一个名为 accuracy_level
的新专栏。我怎样才能完成它?我尝试了 if
但效果不佳。
如果
accuracy_percentage
在 +/-10% 之内,那么accuracy_level
将是 "Good"accuracy_percentage
在 +/-30% 之内,在 +/- 10% 之外,那么accuracy_level
将是 "Bad"accuracy_percentage
在 +/-30% 之外,那么accuracy_level
将是 "Worst"
这是我的代码:
actuals=seq(0,10,0.1)
forecast=seq(10,0,-0.1)
data1=data.frame(actuals,forecast)
data1$diff=data1$actuals-data1$forecast
data1$accuracy_percentage=(data1$diff/data1$actuals)*100
if((data1$accuracy_percentage < 10)&(data1$accuracy_percentage > -10),data1$accuracy_level="good",)
我用了化合物ifelse
data1$accuracy_category <- ifelse(abs(data1$accuracy_percentage)<10, "Good",
ifelse(abs(data1$accuracy_percentage)<30, "Bad", "Worst"))
产量
> head(data1)
actuals forecast diff accuracy_percentage accuracy_category
1 0.0 10.0 -10.0 -Inf Worst
2 0.1 9.9 -9.8 -9800.000 Worst
3 0.2 9.8 -9.6 -4800.000 Worst
4 0.3 9.7 -9.4 -3133.333 Worst
5 0.4 9.6 -9.2 -2300.000 Worst
6 0.5 9.5 -9.0 -1800.000 Worst
正如@pierre-lafortune 所指出的,它更易于阅读但性能较差。本着 Knuth 的精神,我 运行 进行了一些测试。使用您的初始设置:
> system.time(data1$accuracy_category <- ifelse(abs(data1$accuracy_percentage)<10, "Good",
+ ifelse(abs(data1$accuracy_percentage)<30, "Bad", "Worst")))
user system elapsed
0 0 0
> system.time(data1$accuracy_level <- cut(abs(data1$accuracy_percentage), c(0, 10, 30, Inf), c("Good", "Bad", "Worst"), include.lowest=T))
user system elapsed
0.000 0.000 0.001
但这并不能说明什么。所以让我们 c运行k 吧 :) With
actuals=seq(0,100000,0.1)
forecast=seq(100000,0,-0.1)
我得到了
> system.time(data1$accuracy_category <- ifelse(abs(data1$accuracy_percentage)<10, "Good",
+ ifelse(abs(data1$accuracy_percentage)<30, "Bad", "Worst")))
user system elapsed
0.776 0.060 0.840
> system.time(data1$accuracy_level <- cut(abs(data1$accuracy_percentage), c(0, 10, 30, Inf), c("Good", "Bad", "Worst"), include.lowest=T))
user system elapsed
0.152 0.003 0.155
这确实表明 cut
在您扩展时性能会更高。综上所述,cut
即使可读性不差也更优雅,我赞成他的回答 :) ymmv.
data1$accuracy_level <- cut(abs(data1$accuracy_percentage), c(0, 10, 30, Inf), c("Good", "Bad", "Worst"), include.lowest=T)
# actuals forecast diff accuracy_percentage accuracy_level
# 19 1.8 8.2 -6.4 -355.55556 Worst
# 71 7.0 3.0 4.0 57.14286 Worst
# 57 5.6 4.4 1.2 21.42857 Bad
# 17 1.6 8.4 -6.8 -425.00000 Worst
# 92 9.1 0.9 8.2 90.10989 Worst
# 91 9.0 1.0 8.0 88.88889 Worst
# 13 1.2 8.8 -7.6 -633.33333 Worst
# 79 7.8 2.2 5.6 71.79487 Worst
# 44 4.3 5.7 -1.4 -32.55814 Worst
# 51 5.0 5.0 0.0 0.00000 Good
使用 cut
将提高速度和可扩展性。我们找到基于切点 c(0, 10, 30, Inf)
的准确率百分比的绝对值区间 abs
。并为小组提供标签。我们还为落在切点下限的 0.000
个案例添加了参数 include.lowest=TRUE
。
嵌套ifelse
语句是因为读出来容易理解。但如果你必须嵌套 10 个不同的条件,它很容易失控。
请注意,如果我们不需要新的标签名称,我们可以使用相关函数 findInterval
,它本质上会做同样的事情,只是将整数值分配为输出(即 1 2 3 4..
).