如何计算变量每次出现的次数并删除 R 中的异常值
How to count occurence of variable eacht time it occurs and remove outliers in R
我有一个向量。另一方面,我想删除似乎分类不正确的因素。例如位置 7 处的“D”。由于周围是“A”,所以它也应该是“A”。我知道必须有一个规则,例如,如果异常值前后的 3 个值不同,则它会收敛 - 在本例中为 "D" 到 "A" ,否则它会像位置 22 上的 "C" 一样被删除。
Var = c("A", "A", "A", "A","A", "A", "D", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "B", "B", "C","C","C","C","C","C","C","C","C","C","D", "D","D","D","D","D","D","D", "A", "A", "A", "A","A", "A", "A", "A", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C","C","C","C","C","C","C","C", "C","C","C","C","C","C","C","C", "D","D","D","D","D")
Var= as.factor(Var)
Var2=c("1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","1",
"1","1","1","1","1")
df<- data.frame (Var, Var2)
此外,我想计算每个变量的出现次数(如果出现的话)。所以我不想计算整个向量中的出现次数,而是像这样的列表。理想情况下使用更正后的值。
# Var Occurence
#1 A 6
#2 D 1
#3 A 4
#4 B 10
#5 C 1
#6 B 2 ...
我只能计算整个向量的值
table (Var)
通过下面的代码我得到一个列,每次“Var”改变时它开始计数。
df$Var <- with(df, ave(Var, FUN = function(x) sequence(rle(as.character(x))$lengths)))
data.table
这可能会更容易。按 'Var' 的 rleid
(run-length-id) 进行分组,并获取计数 (.N
),然后通过在 [=] 中创建逻辑表达式来删除离群值16=](来自 boxplot
异常值)
library(data.table)
setDT(df)[, .N, .(Var, grp = rleid(Var))][, grp := NULL][
!N %in% boxplot(N, plot = FALSE)$out]
-输出
Var N
1: A 6
2: D 1
3: A 4
4: B 10
5: C 1
6: B 2
7: C 10
8: D 8
9: A 12
10: B 12
11: C 16
12: D 5
rleid
可以采用多个输入列,因为第一个参数是可变的 (...
) - 来自 ?rleid
rleid(..., prefix=NULL)
... A sequence of numeric, integer64, character or logical vectors, all of same length. For interactive use.
因此,如果我们有多个列,要么指定列,要么可以使用 rleidv
和数据的子集。frame/data。table 作为输入
setDT(df)[, .N, .(Var, Var2, grp = rleid(Var, Var2))][,
grp := NULL][ !N %in% boxplot(N, plot = FALSE)$out]
我有一个向量。另一方面,我想删除似乎分类不正确的因素。例如位置 7 处的“D”。由于周围是“A”,所以它也应该是“A”。我知道必须有一个规则,例如,如果异常值前后的 3 个值不同,则它会收敛 - 在本例中为 "D" 到 "A" ,否则它会像位置 22 上的 "C" 一样被删除。
Var = c("A", "A", "A", "A","A", "A", "D", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "B", "B", "C","C","C","C","C","C","C","C","C","C","D", "D","D","D","D","D","D","D", "A", "A", "A", "A","A", "A", "A", "A", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C","C","C","C","C","C","C","C", "C","C","C","C","C","C","C","C", "D","D","D","D","D")
Var= as.factor(Var)
Var2=c("1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","1",
"1","1","1","1","1")
df<- data.frame (Var, Var2)
此外,我想计算每个变量的出现次数(如果出现的话)。所以我不想计算整个向量中的出现次数,而是像这样的列表。理想情况下使用更正后的值。
# Var Occurence
#1 A 6
#2 D 1
#3 A 4
#4 B 10
#5 C 1
#6 B 2 ...
我只能计算整个向量的值
table (Var)
通过下面的代码我得到一个列,每次“Var”改变时它开始计数。
df$Var <- with(df, ave(Var, FUN = function(x) sequence(rle(as.character(x))$lengths)))
data.table
这可能会更容易。按 'Var' 的 rleid
(run-length-id) 进行分组,并获取计数 (.N
),然后通过在 [=] 中创建逻辑表达式来删除离群值16=](来自 boxplot
异常值)
library(data.table)
setDT(df)[, .N, .(Var, grp = rleid(Var))][, grp := NULL][
!N %in% boxplot(N, plot = FALSE)$out]
-输出
Var N
1: A 6
2: D 1
3: A 4
4: B 10
5: C 1
6: B 2
7: C 10
8: D 8
9: A 12
10: B 12
11: C 16
12: D 5
rleid
可以采用多个输入列,因为第一个参数是可变的 (...
) - 来自 ?rleid
rleid(..., prefix=NULL)
... A sequence of numeric, integer64, character or logical vectors, all of same length. For interactive use.
因此,如果我们有多个列,要么指定列,要么可以使用 rleidv
和数据的子集。frame/data。table 作为输入
setDT(df)[, .N, .(Var, Var2, grp = rleid(Var, Var2))][,
grp := NULL][ !N %in% boxplot(N, plot = FALSE)$out]