在 R 中用均值替换异常值时出现的问题
Issues when replacing outliers with mean in R
我有一个 HR 数据框,其中包含与组织中员工相关的信息,例如薪水、部门、ID 等。
我想要做的是将“销售”部门的“Salary_2018”列中的异常值 (USD>200000) 替换为该列本身的平均值。
这是我正在学习的专业课程,我得到了数据框和代码,它们是:
library(readxl)
df<-read_excel("C:\Media Mean Mode.xlsx")
df1<-df[df$department=="Sales",]
df2 = df1
df2[df2$salary_2018<200000,]<-mean(df2$salary_2018)
在我正在学习的视频中,讲师使用具有相同代码的完全相同的数据帧,并且它有效。但是,当我尝试完全相同的事情时,结果收到以下错误:
Errore: Assigned data `mean(df2$salary_2018)` must be compatible with existing data.
i Error occurred for column `department`.
x Can't convert <double> to <character>.
如果我尝试替换“部门”列中的信息,我会理解错误,因为数据类型是“字符”。
但考虑到我正在处理“salary_2018”,这是“双”,为什么错误指的是“部门”?
你知道为什么会这样吗?
谢谢!
编辑:按照 Peter 的建议,我在下面添加了数据框的结构。
> dput(head(df, 5))
structure(list(age = c(41, 49, 37, 33, 27), department = c("Sales",
"Research & Development", "Research & Development", "Research & Development",
"Research & Development"), employee_number = c(1, 2, 4, 5, 7),
gender = c("Female", "Male", "Male", "Female", "Male"), job_level = c(2,
2, 1, 1, 1), marital_status = c("Single", "Married", "Single",
"Married", "Married"), over_time = c("Yes", "No", "Yes",
"Yes", "No"), performance_rating = c(3, 4, 3, 3, 3), totalW_working_years = c(8,
10, 7, 8, 6), training_times_last_year = c(0, 3, 3, 3, 3),
years_since_last_promotion = c(0, 1, 0, 3, 2), years_with_curr_manager = c(5,
7, 0, 0, 2), monthly_income = c(5993, 5130, 2090, 2909, 3468
), salary_2017 = c(71916, 61560, 25080, 34908, 41616), salary_2018 = c(79826.76,
75718.8, 28842, 38747.88, 46609.92), year_of_joining = c(2012,
2008, 2018, 2010, 2016), last_role_change = c(2014, 2011,
2018, 2011, 2016), percent_hike = c(11, 23, 15, 11, 12)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
即使没有实际数据,您的代码也会尝试将所有列替换为薪水低于 200k(不应该高于?)的薪水平均值。这是因为您没有在逗号后指定列,空 space 表示所有列。注意这段代码的区别:
# all columns
mtcars[1:4, ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# columns one, two and three
mtcars[1:4, 1:3]
#> mpg cyl disp
#> Mazda RX4 21.0 6 160
#> Mazda RX4 Wag 21.0 6 160
#> Datsun 710 22.8 4 108
#> Hornet 4 Drive 21.4 6 258
对于你的情况,尝试:
df2[df2$salary_2018 > 200000, "salary_2018"] <- mean(df2$salary_2018, na.rm = TRUE)
我有一个 HR 数据框,其中包含与组织中员工相关的信息,例如薪水、部门、ID 等。
我想要做的是将“销售”部门的“Salary_2018”列中的异常值 (USD>200000) 替换为该列本身的平均值。
这是我正在学习的专业课程,我得到了数据框和代码,它们是:
library(readxl)
df<-read_excel("C:\Media Mean Mode.xlsx")
df1<-df[df$department=="Sales",]
df2 = df1
df2[df2$salary_2018<200000,]<-mean(df2$salary_2018)
在我正在学习的视频中,讲师使用具有相同代码的完全相同的数据帧,并且它有效。但是,当我尝试完全相同的事情时,结果收到以下错误:
Errore: Assigned data `mean(df2$salary_2018)` must be compatible with existing data.
i Error occurred for column `department`.
x Can't convert <double> to <character>.
如果我尝试替换“部门”列中的信息,我会理解错误,因为数据类型是“字符”。
但考虑到我正在处理“salary_2018”,这是“双”,为什么错误指的是“部门”?
你知道为什么会这样吗?
谢谢!
编辑:按照 Peter 的建议,我在下面添加了数据框的结构。
> dput(head(df, 5))
structure(list(age = c(41, 49, 37, 33, 27), department = c("Sales",
"Research & Development", "Research & Development", "Research & Development",
"Research & Development"), employee_number = c(1, 2, 4, 5, 7),
gender = c("Female", "Male", "Male", "Female", "Male"), job_level = c(2,
2, 1, 1, 1), marital_status = c("Single", "Married", "Single",
"Married", "Married"), over_time = c("Yes", "No", "Yes",
"Yes", "No"), performance_rating = c(3, 4, 3, 3, 3), totalW_working_years = c(8,
10, 7, 8, 6), training_times_last_year = c(0, 3, 3, 3, 3),
years_since_last_promotion = c(0, 1, 0, 3, 2), years_with_curr_manager = c(5,
7, 0, 0, 2), monthly_income = c(5993, 5130, 2090, 2909, 3468
), salary_2017 = c(71916, 61560, 25080, 34908, 41616), salary_2018 = c(79826.76,
75718.8, 28842, 38747.88, 46609.92), year_of_joining = c(2012,
2008, 2018, 2010, 2016), last_role_change = c(2014, 2011,
2018, 2011, 2016), percent_hike = c(11, 23, 15, 11, 12)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
即使没有实际数据,您的代码也会尝试将所有列替换为薪水低于 200k(不应该高于?)的薪水平均值。这是因为您没有在逗号后指定列,空 space 表示所有列。注意这段代码的区别:
# all columns
mtcars[1:4, ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# columns one, two and three
mtcars[1:4, 1:3]
#> mpg cyl disp
#> Mazda RX4 21.0 6 160
#> Mazda RX4 Wag 21.0 6 160
#> Datsun 710 22.8 4 108
#> Hornet 4 Drive 21.4 6 258
对于你的情况,尝试:
df2[df2$salary_2018 > 200000, "salary_2018"] <- mean(df2$salary_2018, na.rm = TRUE)