R GSRUB 函数

R GSRUB function

使用 public 数据集,其中变量分类存储描述与 LicenseNo 关联的许可证类型的代码。任何许可证都可以有 1 到 19 种不同的并发许可证类型,这些许可证类型与不同的 licenseNo 相关联。函数似乎是将分类拆分为 [1:19] 新列 Classification1:Classification19 的正确策略。不知道从哪里开始。我还需要将代码转换为描述,创建了一个 table 来支持这篇文章,因为我从网站上读到的内容认为它可以作为 rda 文件引入。不知道从哪里开始。

LicenseNo <- c("1000002","1000003","1000012","1000015","1000026")
Classifications <- c("C57","C-6","B","C60| C51", "HAZ| C36| C10| A| B| C57| C-2| C-8| C12| C21| C27| C29| C35| C42| C45| C39| C50| C51| C31")
data <- data.frame(LicenseNo,Classifications)
View(data)

Descriptions <- c("Cabinet, Millwork and Finish Carpentry Contractor","General Building Contractor",
                  "Well Drilling Contractor", "Structural Steel Contractor","Welding Contractor",
                  "Hazardous Substance Removal Certification","Plumbing Contractor","Electrical Contractor",
                  "General Engineering Contractor", "Insulation and Acoustical Contractor")
Classifications <- c("C-6","B","C57","C51","C60","HAZ","C36","C10","A","C-2")
class_type <- data.frame(Descriptions,Classifications)
View(class_type)

最终,希望创建以下输出,...只列出了 4 个分类以供观察 1000026 以简化。

整洁宇宙

library(dplyr)
# library(tidyr) # unnest, pivot_*
out <- data %>%
  mutate(Classifications = strsplit(Classifications, "[|\s]+")) %>%
  tidyr::unnest(Classifications) %>%
  mutate(Classifications = trimws(Classifications)) %>%
  left_join(class_type, by = "Classifications") %>%
  mutate(Classifications = coalesce(Descriptions, Classifications)) %>%
  select(-Descriptions)

out
# # A tibble: 24 x 2
#    LicenseNo Classifications                                  
#    <chr>     <chr>                                            
#  1 1000002   Well Drilling Contractor                         
#  2 1000003   Cabinet, Millwork and Finish Carpentry Contractor
#  3 1000012   General Building Contractor                      
#  4 1000015   Welding Contractor                               
#  5 1000015   Structural Steel Contractor                      
#  6 1000026   Hazardous Substance Removal Certification        
#  7 1000026   Plumbing Contractor                              
#  8 1000026   Electrical Contractor                            
#  9 1000026   General Engineering Contractor                   
# 10 1000026   General Building Contractor                      
# # ... with 14 more rows

注意: 我coalesced 描述和原始分类因为缺少组件。例如,没有 coalesce 我们会看到:

out <- data %>%
  mutate(Classifications = strsplit(Classifications, "[|\s]+")) %>%
  tidyr::unnest(Classifications) %>%
  mutate(Classifications = trimws(Classifications)) %>%
  left_join(class_type, by = "Classifications")
print(out,n=99)
# # A tibble: 24 x 3
#    LicenseNo Classifications Descriptions                                     
#    <chr>     <chr>           <chr>                                            
#  1 1000002   C57             Well Drilling Contractor                         
#  2 1000003   C-6             Cabinet, Millwork and Finish Carpentry Contractor
#  3 1000012   B               General Building Contractor                      
#  4 1000015   C60             Welding Contractor                               
#  5 1000015   C51             Structural Steel Contractor                      
#  6 1000026   HAZ             Hazardous Substance Removal Certification        
#  7 1000026   C36             Plumbing Contractor                              
#  8 1000026   C10             Electrical Contractor                            
#  9 1000026   A               General Engineering Contractor                   
# 10 1000026   B               General Building Contractor                      
# 11 1000026   C57             Well Drilling Contractor                         
# 12 1000026   C-2             Insulation and Acoustical Contractor             
# 13 1000026   C-8             <NA>                                             
# 14 1000026   C12             <NA>                                             
# 15 1000026   C21             <NA>                                             
# 16 1000026   C27             <NA>                                             
# 17 1000026   C29             <NA>                                             
# 18 1000026   C35             <NA>                                             
# 19 1000026   C42             <NA>                                             
# 20 1000026   C45             <NA>                                             
# 21 1000026   C39             <NA>                                             
# 22 1000026   C50             <NA>                                             
# 23 1000026   C51             Structural Steel Contractor                      
# 24 1000026   C31             <NA>                                             

我的猜测是您更愿意保留“某物”副 NA,因此当描述缺失时,我默认用分类替换 NA。如果您的数据没有这样的顾虑,那么您可以跳过该步骤(并将描述重命名为分类)。

长格式适用于很多事情(尤其是 ggplot2 和类似的“整洁”操作),但如果您想要宽格式,那么

out %>%
  group_by(LicenseNo) %>%
  mutate(rn = paste0("Classification", row_number())) %>%
  ungroup() %>%
  tidyr::pivot_wider(LicenseNo, names_from = rn, values_from = Classifications)
# # A tibble: 5 x 20
#   LicenseNo Classification1 Classification2 Classification3 Classification4 Classification5 Classification6 Classification7
#   <chr>     <chr>           <chr>           <chr>           <chr>           <chr>           <chr>           <chr>          
# 1 1000002   Well Drilling ~ <NA>            <NA>            <NA>            <NA>            <NA>            <NA>           
# 2 1000003   Cabinet, Millw~ <NA>            <NA>            <NA>            <NA>            <NA>            <NA>           
# 3 1000012   General Buildi~ <NA>            <NA>            <NA>            <NA>            <NA>            <NA>           
# 4 1000015   Welding Contra~ Structural Ste~ <NA>            <NA>            <NA>            <NA>            <NA>           
# 5 1000026   Hazardous Subs~ Plumbing Contr~ Electrical Con~ General Engine~ General Buildi~ Well Drilling ~ Insulation and~
# # ... with 12 more variables: Classification8 <chr>, Classification9 <chr>, Classification10 <chr>, Classification11 <chr>,
# #   Classification12 <chr>, Classification13 <chr>, Classification14 <chr>, Classification15 <chr>, Classification16 <chr>,
# #   Classification17 <chr>, Classification18 <chr>, Classification19 <chr>