对变量进行分类并使其成为 R 中的虚拟变量

Categorizing variable and make it a dummy in R

古腾标签社区:)

我目前正在研究一些公司的 ESGscore。 ESGscore 在 0 到 100 之间变化。

我想将 ESGscore 分为 4 个部分:

0 - 25 --> poor --> 4
>25 - 50 --> medium --> 3
>50 - 75 --> good --> 2
75 - 100 --> excellent --> 1

dummy.code 的问题在于它正在重新排列 ESGscore。因此,例如 AIR PRODUCTS & CHEMICALS INC 的 ESGscore 始终是 'excellent',但输出显示我只是中等。

这就是 CODE 的样子:

Datensatz_final_so$ESG.Kategorien <- ifelse(Datensatz_final_so$ESGscore <= 25, "4",
                                            ifelse(Datensatz_final_so$ESGscore > 25 & Datensatz_final_so$ESGscore <= 50, "3",
                                                   ifelse(Datensatz_final_so$ESGscore > 50 & Datensatz_final_so$ESGscore <= 75, "2",
                                                          ifelse(Datensatz_final_so > 75, "1", 0))))``
    # Create ESGscore dummy #
    Dummy.ESG <- dummy.code(Datensatz_final_so$ESG.Kategorien)

colnames(Dummy.ESG) <- c("poor", "medium", "good", "excellent")

# Connect data and dummy #
Datensatz_final <- cbind(Datensatz_final, Dummy.ESG)

你知道怎么解决吗?

一种方法是将 colnames 重新排列为

colnames(Dummy.ESG) <- c("good", "excellent", "poor", "medium")

但它正在产生问题,即 R 在分析中选择介质作为参考。

提前致谢! :)

数据示例:

    structure(list(Company = c("AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC", "HESS CORP", "HESS CORP", "HESS CORP", 
"HESS CORP", "HESS CORP", "HESS CORP", "HESS CORP", "APACHE CORP", 
"APACHE CORP", "APACHE CORP", "APACHE CORP", "APACHE CORP", "APACHE CORP", 
"APACHE CORP", "AVERY DENNISON CORP", "AVERY DENNISON CORP", 
"AVERY DENNISON CORP", "AVERY DENNISON CORP", "AVERY DENNISON CORP", 
"AVERY DENNISON CORP", "AVERY DENNISON CORP", "BALL CORP", "BALL CORP", 
"BALL CORP", "BALL CORP", "BALL CORP", "BALL CORP", "BALL CORP", 
"CHEVRON CORP", "CHEVRON CORP", "CHEVRON CORP", "CHEVRON CORP", 
"CHEVRON CORP", "CHEVRON CORP", "CHEVRON CORP", "ECOLAB INC", 
"ECOLAB INC", "ECOLAB INC", "ECOLAB INC", "ECOLAB INC", "ECOLAB INC", 
"ECOLAB INC", "EXXON MOBIL CORP", "EXXON MOBIL CORP", "EXXON MOBIL CORP", 
"EXXON MOBIL CORP", "EXXON MOBIL CORP", "EXXON MOBIL CORP", "EXXON MOBIL CORP", 
"FMC CORP", "FMC CORP", "FMC CORP", "FMC CORP", "FMC CORP", "FMC CORP", 
"FMC CORP", "HALLIBURTON CO", "HALLIBURTON CO", "HALLIBURTON CO", 
"HALLIBURTON CO", "HALLIBURTON CO", "HALLIBURTON CO", "HALLIBURTON CO", 
"HELMERICH & PAYNE", "HELMERICH & PAYNE", "HELMERICH & PAYNE", 
"HELMERICH & PAYNE", "HELMERICH & PAYNE", "HELMERICH & PAYNE", 
"HELMERICH & PAYNE"), Year = c(2011, 2012, 2013, 2014, 2015, 
2016, 2017, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2011, 2012, 
2013, 2014, 2015, 2016, 2017, 2011, 2012, 2013, 2014, 2015, 2016, 
2017, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2011, 2012, 2013, 
2014, 2015, 2016, 2017, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 
2011, 2012, 2013, 2014, 2015, 2016, 2017, 2011, 2012, 2013, 2014, 
2015, 2016, 2017, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2011, 
2012, 2013, 2014, 2015, 2016, 2017), gvkey = c(1209, 1209, 1209, 
1209, 1209, 1209, 1209, 1380, 1380, 1380, 1380, 1380, 1380, 1380, 
1678, 1678, 1678, 1678, 1678, 1678, 1678, 1913, 1913, 1913, 1913, 
1913, 1913, 1913, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 2991, 
2991, 2991, 2991, 2991, 2991, 2991, 4213, 4213, 4213, 4213, 4213, 
4213, 4213, 4503, 4503, 4503, 4503, 4503, 4503, 4503, 4510, 4510, 
4510, 4510, 4510, 4510, 4510, 5439, 5439, 5439, 5439, 5439, 5439, 
5439, 5581, 5581, 5581, 5581, 5581, 5581, 5581), ESGscore = c(84.2750015258789, 
81.9225006103516, 77.4024963378906, 80.1125030517578, 78.6449966430664, 
76.3775024414062, 79.2699966430664, 69.4899978637695, 65.8300018310547, 
64.4300003051758, 74.3000030517578, 75.7600021362305, 71.4599990844727, 
74.6900024414062, 55.8300018310547, 56.0900001525879, 57.5, 60.75, 
60.8800010681152, 67.379997253418, 71.9899978637695, 82.9000015258789, 
77.3899993896484, 76.9300003051758, 78.7399978637695, 76.2283325195312, 
74.2125015258789, 68.3600006103516, 64.4100036621094, 65.6600036621094, 
63.75, 67.7300033569336, 67.5699996948242, 74.4300003051758, 
68.5699996948242, 86.5100021362305, 84.3099975585938, 82.6600036621094, 
82.3399963378906, 88.4100036621094, 90.0800018310547, 92.25, 
74.6999969482422, 72.3600006103516, 68.3899993896484, 67.9300003051758, 
65.629997253418, 74.9000015258789, 74.8600006103516, 81.6999969482422, 
79.370002746582, 79.0899963378906, 75.25, 81.9499969482422, 81.0199966430664, 
88.3399963378906, 59.8199996948242, 55.6500015258789, 52.2999992370605, 
51.8499984741211, 56.9199981689453, 66.620002746582, 65.3300018310547, 
85.9800033569336, 83.9499969482422, 85.1100006103516, 67.4300003051758, 
76.4400024414062, 69.9199981689453, 78.4599990844727, 19.0599994659424, 
17.5200004577637, 18.1200008392334, 23.5025005340576, 35.5349998474121, 
36.7350006103516, 41.1725006103516)), row.names = c(NA, -77L), class = c("tbl_df", 
"tbl", "data.frame"))

让我们将您的数据作为 df:

df<- structure(
  list(
    Company = c(
      "AIR PRODUCTS & CHEMICALS INC",
      ...
      ...
  ),
  row.names = c(NA,-77L),
  class = c("tbl_df",
            "tbl", "data.frame")
)

让我们进行分类并构建一个小型数据框,然后是一些 dplyr

ESG <- c("poor", "medium", "good", "excellent")
da <- data.frame(ESGColumn = 1:4,FlatESG = ESG)

df <- df |> dplyr::mutate(ESGColumn = floor(ESGscore/25)+1) |>
  dplyr::left_join(da, by="ESGColumn") |>
  dplyr::select(-"ESGColumn")

head(df)

# A tibble: 6 × 5
  Company                       Year gvkey ESGscore FlatESG  
  <chr>                        <dbl> <dbl>    <dbl> <chr>    
1 AIR PRODUCTS & CHEMICALS INC  2011  1209     84.3 excellent
2 AIR PRODUCTS & CHEMICALS INC  2012  1209     81.9 excellent
3 AIR PRODUCTS & CHEMICALS INC  2013  1209     77.4 excellent
4 AIR PRODUCTS & CHEMICALS INC  2014  1209     80.1 excellent
5 AIR PRODUCTS & CHEMICALS INC  2015  1209     78.6 excellent
6 AIR PRODUCTS & CHEMICALS INC  2016  1209     76.4 excellent

格热戈日

您的示例数据不包含名为 ESG.Kategorien 的变量,但它包含 ESGscore。以下应该给你你想要的:

Datensatz_final_so$Dummy <- cut(Datensatz_final_so$ESGscore, breaks=c(0, 25, 50, 75, 100), labels=c("poor", "medium", "good", "excellent"))
table(Datensatz_final_so$Dummy)
# 
#      poor    medium      good excellent 
#         4         3        38        32 
levels(Datensatz_final_so$Dummy)
# [1] "poor"      "medium"    "good"      "excellent" 

请注意您的原始分类将 75 列为良好和优秀。