如何计算变量每次出现的次数并删除 R 中的异常值

How to count occurence of variable eacht time it occurs and remove outliers in R

我有一个向量。另一方面,我想删除似乎分类不正确的因素。例如位置 7 处的“D”。由于周围是“A”,所以它也应该是“A”。我知道必须有一个规则,例如,如果异常值前后的 3 个值不同,则它会收敛 - 在本例中为 "D" 到 "A" ,否则它会像位置 22 上的 "C" 一样被删除。

Var = c("A", "A", "A", "A","A", "A", "D", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C", "B", "B", "C","C","C","C","C","C","C","C","C","C","D", "D","D","D","D","D","D","D", "A", "A", "A", "A","A", "A", "A", "A", "A", "A","A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "C","C","C","C","C","C","C","C", "C","C","C","C","C","C","C","C", "D","D","D","D","D")

Var= as.factor(Var)



   Var2=c("1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1",
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", 
"1", "2", "1", "2","3", "2", "1", "2", "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1",  "2","2", "2", "1", "1", "2","2", "2", "1", "1","2","2", "2", "1", "1", "2","2", "2", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1", "2","2", "2", "1", "1","1",
 "1","1","1","1","1")

df<- data.frame (Var, Var2)

此外,我想计算每个变量的出现次数(如果出现的话)。所以我不想计算整个向量中的出现次数,而是像这样的列表。理想情况下使用更正后的值。

#   Var Occurence
#1  A 6
#2  D 1
#3  A 4
#4  B 10
#5  C 1
#6  B 2 ...

我只能计算整个向量的值

table (Var)

通过下面的代码我得到一个列,每次“Var”改变时它开始计数。

df$Var <- with(df, ave(Var, FUN = function(x) sequence(rle(as.character(x))$lengths)))

data.table 这可能会更容易。按 'Var' 的 rleid (run-length-id) 进行分组,并获取计数 (.N),然后通过在 [=] 中创建逻辑表达式来删除离群值16=](来自 boxplot 异常值)

library(data.table)
setDT(df)[, .N, .(Var, grp = rleid(Var))][, grp := NULL][
   !N %in% boxplot(N, plot = FALSE)$out]

-输出

    Var  N
 1:   A  6
 2:   D  1
 3:   A  4
 4:   B 10
 5:   C  1
 6:   B  2
 7:   C 10
 8:   D  8
 9:   A 12
10:   B 12
11:   C 16
12:   D  5

rleid 可以采用多个输入列,因为第一个参数是可变的 (...) - 来自 ?rleid

rleid(..., prefix=NULL)

... A sequence of numeric, integer64, character or logical vectors, all of same length. For interactive use.

因此,如果我们有多个列,要么指定列,要么可以使用 rleidv 和数据的子集。frame/data。table 作为输入

setDT(df)[, .N, .(Var,  Var2, grp = rleid(Var, Var2))][,
    grp := NULL][ !N %in% boxplot(N, plot = FALSE)$out]