R GSRUB 函数
R GSRUB function
使用 public 数据集,其中变量分类存储描述与 LicenseNo 关联的许可证类型的代码。任何许可证都可以有 1 到 19 种不同的并发许可证类型,这些许可证类型与不同的 licenseNo 相关联。函数似乎是将分类拆分为 [1:19] 新列 Classification1:Classification19 的正确策略。不知道从哪里开始。我还需要将代码转换为描述,创建了一个 table 来支持这篇文章,因为我从网站上读到的内容认为它可以作为 rda 文件引入。不知道从哪里开始。
LicenseNo <- c("1000002","1000003","1000012","1000015","1000026")
Classifications <- c("C57","C-6","B","C60| C51", "HAZ| C36| C10| A| B| C57| C-2| C-8| C12| C21| C27| C29| C35| C42| C45| C39| C50| C51| C31")
data <- data.frame(LicenseNo,Classifications)
View(data)
Descriptions <- c("Cabinet, Millwork and Finish Carpentry Contractor","General Building Contractor",
"Well Drilling Contractor", "Structural Steel Contractor","Welding Contractor",
"Hazardous Substance Removal Certification","Plumbing Contractor","Electrical Contractor",
"General Engineering Contractor", "Insulation and Acoustical Contractor")
Classifications <- c("C-6","B","C57","C51","C60","HAZ","C36","C10","A","C-2")
class_type <- data.frame(Descriptions,Classifications)
View(class_type)
最终,希望创建以下输出,...只列出了 4 个分类以供观察 1000026 以简化。
整洁宇宙
library(dplyr)
# library(tidyr) # unnest, pivot_*
out <- data %>%
mutate(Classifications = strsplit(Classifications, "[|\s]+")) %>%
tidyr::unnest(Classifications) %>%
mutate(Classifications = trimws(Classifications)) %>%
left_join(class_type, by = "Classifications") %>%
mutate(Classifications = coalesce(Descriptions, Classifications)) %>%
select(-Descriptions)
out
# # A tibble: 24 x 2
# LicenseNo Classifications
# <chr> <chr>
# 1 1000002 Well Drilling Contractor
# 2 1000003 Cabinet, Millwork and Finish Carpentry Contractor
# 3 1000012 General Building Contractor
# 4 1000015 Welding Contractor
# 5 1000015 Structural Steel Contractor
# 6 1000026 Hazardous Substance Removal Certification
# 7 1000026 Plumbing Contractor
# 8 1000026 Electrical Contractor
# 9 1000026 General Engineering Contractor
# 10 1000026 General Building Contractor
# # ... with 14 more rows
注意: 我coalesce
d 描述和原始分类因为缺少组件。例如,没有 coalesce
我们会看到:
out <- data %>%
mutate(Classifications = strsplit(Classifications, "[|\s]+")) %>%
tidyr::unnest(Classifications) %>%
mutate(Classifications = trimws(Classifications)) %>%
left_join(class_type, by = "Classifications")
print(out,n=99)
# # A tibble: 24 x 3
# LicenseNo Classifications Descriptions
# <chr> <chr> <chr>
# 1 1000002 C57 Well Drilling Contractor
# 2 1000003 C-6 Cabinet, Millwork and Finish Carpentry Contractor
# 3 1000012 B General Building Contractor
# 4 1000015 C60 Welding Contractor
# 5 1000015 C51 Structural Steel Contractor
# 6 1000026 HAZ Hazardous Substance Removal Certification
# 7 1000026 C36 Plumbing Contractor
# 8 1000026 C10 Electrical Contractor
# 9 1000026 A General Engineering Contractor
# 10 1000026 B General Building Contractor
# 11 1000026 C57 Well Drilling Contractor
# 12 1000026 C-2 Insulation and Acoustical Contractor
# 13 1000026 C-8 <NA>
# 14 1000026 C12 <NA>
# 15 1000026 C21 <NA>
# 16 1000026 C27 <NA>
# 17 1000026 C29 <NA>
# 18 1000026 C35 <NA>
# 19 1000026 C42 <NA>
# 20 1000026 C45 <NA>
# 21 1000026 C39 <NA>
# 22 1000026 C50 <NA>
# 23 1000026 C51 Structural Steel Contractor
# 24 1000026 C31 <NA>
我的猜测是您更愿意保留“某物”副 NA
,因此当描述缺失时,我默认用分类替换 NA
。如果您的数据没有这样的顾虑,那么您可以跳过该步骤(并将描述重命名为分类)。
长格式适用于很多事情(尤其是 ggplot2
和类似的“整洁”操作),但如果您想要宽格式,那么
out %>%
group_by(LicenseNo) %>%
mutate(rn = paste0("Classification", row_number())) %>%
ungroup() %>%
tidyr::pivot_wider(LicenseNo, names_from = rn, values_from = Classifications)
# # A tibble: 5 x 20
# LicenseNo Classification1 Classification2 Classification3 Classification4 Classification5 Classification6 Classification7
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1000002 Well Drilling ~ <NA> <NA> <NA> <NA> <NA> <NA>
# 2 1000003 Cabinet, Millw~ <NA> <NA> <NA> <NA> <NA> <NA>
# 3 1000012 General Buildi~ <NA> <NA> <NA> <NA> <NA> <NA>
# 4 1000015 Welding Contra~ Structural Ste~ <NA> <NA> <NA> <NA> <NA>
# 5 1000026 Hazardous Subs~ Plumbing Contr~ Electrical Con~ General Engine~ General Buildi~ Well Drilling ~ Insulation and~
# # ... with 12 more variables: Classification8 <chr>, Classification9 <chr>, Classification10 <chr>, Classification11 <chr>,
# # Classification12 <chr>, Classification13 <chr>, Classification14 <chr>, Classification15 <chr>, Classification16 <chr>,
# # Classification17 <chr>, Classification18 <chr>, Classification19 <chr>
使用 public 数据集,其中变量分类存储描述与 LicenseNo 关联的许可证类型的代码。任何许可证都可以有 1 到 19 种不同的并发许可证类型,这些许可证类型与不同的 licenseNo 相关联。函数似乎是将分类拆分为 [1:19] 新列 Classification1:Classification19 的正确策略。不知道从哪里开始。我还需要将代码转换为描述,创建了一个 table 来支持这篇文章,因为我从网站上读到的内容认为它可以作为 rda 文件引入。不知道从哪里开始。
LicenseNo <- c("1000002","1000003","1000012","1000015","1000026")
Classifications <- c("C57","C-6","B","C60| C51", "HAZ| C36| C10| A| B| C57| C-2| C-8| C12| C21| C27| C29| C35| C42| C45| C39| C50| C51| C31")
data <- data.frame(LicenseNo,Classifications)
View(data)
Descriptions <- c("Cabinet, Millwork and Finish Carpentry Contractor","General Building Contractor",
"Well Drilling Contractor", "Structural Steel Contractor","Welding Contractor",
"Hazardous Substance Removal Certification","Plumbing Contractor","Electrical Contractor",
"General Engineering Contractor", "Insulation and Acoustical Contractor")
Classifications <- c("C-6","B","C57","C51","C60","HAZ","C36","C10","A","C-2")
class_type <- data.frame(Descriptions,Classifications)
View(class_type)
最终,希望创建以下输出,...只列出了 4 个分类以供观察 1000026 以简化。
整洁宇宙
library(dplyr)
# library(tidyr) # unnest, pivot_*
out <- data %>%
mutate(Classifications = strsplit(Classifications, "[|\s]+")) %>%
tidyr::unnest(Classifications) %>%
mutate(Classifications = trimws(Classifications)) %>%
left_join(class_type, by = "Classifications") %>%
mutate(Classifications = coalesce(Descriptions, Classifications)) %>%
select(-Descriptions)
out
# # A tibble: 24 x 2
# LicenseNo Classifications
# <chr> <chr>
# 1 1000002 Well Drilling Contractor
# 2 1000003 Cabinet, Millwork and Finish Carpentry Contractor
# 3 1000012 General Building Contractor
# 4 1000015 Welding Contractor
# 5 1000015 Structural Steel Contractor
# 6 1000026 Hazardous Substance Removal Certification
# 7 1000026 Plumbing Contractor
# 8 1000026 Electrical Contractor
# 9 1000026 General Engineering Contractor
# 10 1000026 General Building Contractor
# # ... with 14 more rows
注意: 我coalesce
d 描述和原始分类因为缺少组件。例如,没有 coalesce
我们会看到:
out <- data %>%
mutate(Classifications = strsplit(Classifications, "[|\s]+")) %>%
tidyr::unnest(Classifications) %>%
mutate(Classifications = trimws(Classifications)) %>%
left_join(class_type, by = "Classifications")
print(out,n=99)
# # A tibble: 24 x 3
# LicenseNo Classifications Descriptions
# <chr> <chr> <chr>
# 1 1000002 C57 Well Drilling Contractor
# 2 1000003 C-6 Cabinet, Millwork and Finish Carpentry Contractor
# 3 1000012 B General Building Contractor
# 4 1000015 C60 Welding Contractor
# 5 1000015 C51 Structural Steel Contractor
# 6 1000026 HAZ Hazardous Substance Removal Certification
# 7 1000026 C36 Plumbing Contractor
# 8 1000026 C10 Electrical Contractor
# 9 1000026 A General Engineering Contractor
# 10 1000026 B General Building Contractor
# 11 1000026 C57 Well Drilling Contractor
# 12 1000026 C-2 Insulation and Acoustical Contractor
# 13 1000026 C-8 <NA>
# 14 1000026 C12 <NA>
# 15 1000026 C21 <NA>
# 16 1000026 C27 <NA>
# 17 1000026 C29 <NA>
# 18 1000026 C35 <NA>
# 19 1000026 C42 <NA>
# 20 1000026 C45 <NA>
# 21 1000026 C39 <NA>
# 22 1000026 C50 <NA>
# 23 1000026 C51 Structural Steel Contractor
# 24 1000026 C31 <NA>
我的猜测是您更愿意保留“某物”副 NA
,因此当描述缺失时,我默认用分类替换 NA
。如果您的数据没有这样的顾虑,那么您可以跳过该步骤(并将描述重命名为分类)。
长格式适用于很多事情(尤其是 ggplot2
和类似的“整洁”操作),但如果您想要宽格式,那么
out %>%
group_by(LicenseNo) %>%
mutate(rn = paste0("Classification", row_number())) %>%
ungroup() %>%
tidyr::pivot_wider(LicenseNo, names_from = rn, values_from = Classifications)
# # A tibble: 5 x 20
# LicenseNo Classification1 Classification2 Classification3 Classification4 Classification5 Classification6 Classification7
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1000002 Well Drilling ~ <NA> <NA> <NA> <NA> <NA> <NA>
# 2 1000003 Cabinet, Millw~ <NA> <NA> <NA> <NA> <NA> <NA>
# 3 1000012 General Buildi~ <NA> <NA> <NA> <NA> <NA> <NA>
# 4 1000015 Welding Contra~ Structural Ste~ <NA> <NA> <NA> <NA> <NA>
# 5 1000026 Hazardous Subs~ Plumbing Contr~ Electrical Con~ General Engine~ General Buildi~ Well Drilling ~ Insulation and~
# # ... with 12 more variables: Classification8 <chr>, Classification9 <chr>, Classification10 <chr>, Classification11 <chr>,
# # Classification12 <chr>, Classification13 <chr>, Classification14 <chr>, Classification15 <chr>, Classification16 <chr>,
# # Classification17 <chr>, Classification18 <chr>, Classification19 <chr>