在 R 中用均值替换异常值时出现的问题

Issues when replacing outliers with mean in R

我有一个 HR 数据框,其中包含与组织中员工相关的信息,例如薪水、部门、ID 等。

我想要做的是将“销售”部门的“Salary_2018”列中的异常值 (USD>200000) 替换为该列本身的平均值。

这是我正在学习的专业课程,我得到了数据框和代码,它们是:

library(readxl)
df<-read_excel("C:\Media Mean Mode.xlsx")
df1<-df[df$department=="Sales",]
df2 = df1
df2[df2$salary_2018<200000,]<-mean(df2$salary_2018)

在我正在学习的视频中,讲师使用具有相同代码的完全相同的数据帧,并且它有效。但是,当我尝试完全相同的事情时,结果收到以下错误:

Errore: Assigned data `mean(df2$salary_2018)` must be compatible with existing data.
i Error occurred for column `department`.
x Can't convert <double> to <character>.

如果我尝试替换“部门”列中的信息,我会理解错误,因为数据类型是“字符”。

但考虑到我正在处理“salary_2018”,这是“双”,为什么错误指的是“部门”?

你知道为什么会这样吗?

谢谢!

编辑:按照 Peter 的建议,我在下面添加了数据框的结构。

> dput(head(df, 5))

structure(list(age = c(41, 49, 37, 33, 27), department = c("Sales", 
"Research & Development", "Research & Development", "Research & Development", 
"Research & Development"), employee_number = c(1, 2, 4, 5, 7), 
    gender = c("Female", "Male", "Male", "Female", "Male"), job_level = c(2, 
    2, 1, 1, 1), marital_status = c("Single", "Married", "Single", 
    "Married", "Married"), over_time = c("Yes", "No", "Yes", 
    "Yes", "No"), performance_rating = c(3, 4, 3, 3, 3), totalW_working_years = c(8, 
    10, 7, 8, 6), training_times_last_year = c(0, 3, 3, 3, 3), 
    years_since_last_promotion = c(0, 1, 0, 3, 2), years_with_curr_manager = c(5, 
    7, 0, 0, 2), monthly_income = c(5993, 5130, 2090, 2909, 3468
    ), salary_2017 = c(71916, 61560, 25080, 34908, 41616), salary_2018 = c(79826.76, 
    75718.8, 28842, 38747.88, 46609.92), year_of_joining = c(2012, 
    2008, 2018, 2010, 2016), last_role_change = c(2014, 2011, 
    2018, 2011, 2016), percent_hike = c(11, 23, 15, 11, 12)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))

即使没有实际数据,您的代码也会尝试将所有列替换为薪水低于 200k(不应该高于?)的薪水平均值。这是因为您没有在逗号后指定列,空 space 表示所有列。注意这段代码的区别:

# all columns
mtcars[1:4, ]
#>                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

# columns one, two and three
mtcars[1:4, 1:3]
#>                 mpg cyl disp
#> Mazda RX4      21.0   6  160
#> Mazda RX4 Wag  21.0   6  160
#> Datsun 710     22.8   4  108
#> Hornet 4 Drive 21.4   6  258

对于你的情况,尝试:

df2[df2$salary_2018 > 200000, "salary_2018"] <- mean(df2$salary_2018, na.rm = TRUE)