将数据框的一列拆分为不同的列时出现问题
Issue while splitting a column of a data frame into different columns
这是我正在使用的数据框示例。
structure(list(Company.Name = c("Ample Softech System", "Ziff Davis LLC",
"IIM Kozhikkode", "Perennial", "Irupar Sociedad Cooperativa",
"md", ""), Job.Title = c("Data Analyst", "Data Analyst", "Data Analyst",
"Data Analyst", "Data Analyst", "Data Analyst", "Data Analyst"
), Salaries.Reported = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), Location = c("Pune",
"Pune", "Pune", "Pune", "Pune", "Pune", "Pune"), Salary = c("₹35,563/mo",
"₹5,21,474/yr", "₹7,64,702/yr", "₹16,123/mo", "₹6,04,401/yr",
"AFN 1,56,179/yr", "₹23,500/mo")), row.names = 2274:2280, class = "data.frame")
Salary 列包含以 (Currency_symbol+Figure+periodicity) 模式排列的数字
例如:¥35,563/月
我一直在尝试将模式分成不同的列。我使用了以下代码。
smpl = separate(sample, col = Salary, into = c( "Currency_symbol", "Salary_copy"), sep = 1, remove = TRUE, convert = TRUE) #separates currency_symbol into separate column
smpl
smpl2 = separate(smpl, col = Salary_copy, into = c('Salary_copy', 'Periodicity'), sep = -3, remove = TRUE, convert = TRUE) # separates periodicity to separate column
smpl2
我面临的问题是一行包含 3 个字符作为货币符号 (AFN),而其他的是单个字符。
所以上面提到的这些特定代码行无法将模式分成特定行的相应列。
如果我更改代码的 sep 参数的索引,所有其他行都会受到影响。我该如何解决这个特定问题?
可能的解决方案:
library(tidyverse)
df %>%
separate(Salary, sep="((?<=^\D)(?=\d))|((?<=\D)\s)", into=str_c("col", 1:2)) %>%
separate(col2, sep = "/", into = str_c("col",2:3))
#> Company.Name Job.Title Salaries.Reported Location col1
#> 2274 Ample Softech System Data Analyst 1 Pune ₹
#> 2275 Ziff Davis LLC Data Analyst 1 Pune ₹
#> 2276 IIM Kozhikkode Data Analyst 1 Pune ₹
#> 2277 Perennial Data Analyst 1 Pune ₹
#> 2278 Irupar Sociedad Cooperativa Data Analyst 1 Pune ₹
#> 2279 md Data Analyst 1 Pune AFN
#> 2280 Data Analyst 1 Pune ₹
#> col2 col3
#> 2274 35,563 mo
#> 2275 5,21,474 yr
#> 2276 7,64,702 yr
#> 2277 16,123 mo
#> 2278 6,04,401 yr
#> 2279 1,56,179 yr
#> 2280 23,500 mo
另一个使用 extract
和更简单的正则表达式的解决方案。一个额外的步骤修剪空格并从工资金额中删除逗号。
df2 <- df %>%
extract(Salary, c('currency', 'amount', 'period'), '^(\D+)([0-9,]+)/(.*)') %>%
mutate(
currency = gsub(' ', '', currency),
amount = as.numeric(gsub(',', '', amount))
)
Company.Name Job.Title Salaries.Reported Location currency amount period
2274 Ample Softech System Data Analyst 1 Pune ₹ 35563 mo
2275 Ziff Davis LLC Data Analyst 1 Pune ₹ 521474 yr
2276 IIM Kozhikkode Data Analyst 1 Pune ₹ 764702 yr
2277 Perennial Data Analyst 1 Pune ₹ 16123 mo
2278 Irupar Sociedad Cooperativa Data Analyst 1 Pune ₹ 604401 yr
2279 md Data Analyst 1 Pune AFN 156179 yr
2280 Data Analyst 1 Pune ₹ 23500 mo
这是我正在使用的数据框示例。
structure(list(Company.Name = c("Ample Softech System", "Ziff Davis LLC",
"IIM Kozhikkode", "Perennial", "Irupar Sociedad Cooperativa",
"md", ""), Job.Title = c("Data Analyst", "Data Analyst", "Data Analyst",
"Data Analyst", "Data Analyst", "Data Analyst", "Data Analyst"
), Salaries.Reported = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), Location = c("Pune",
"Pune", "Pune", "Pune", "Pune", "Pune", "Pune"), Salary = c("₹35,563/mo",
"₹5,21,474/yr", "₹7,64,702/yr", "₹16,123/mo", "₹6,04,401/yr",
"AFN 1,56,179/yr", "₹23,500/mo")), row.names = 2274:2280, class = "data.frame")
Salary 列包含以 (Currency_symbol+Figure+periodicity) 模式排列的数字 例如:¥35,563/月
我一直在尝试将模式分成不同的列。我使用了以下代码。
smpl = separate(sample, col = Salary, into = c( "Currency_symbol", "Salary_copy"), sep = 1, remove = TRUE, convert = TRUE) #separates currency_symbol into separate column
smpl
smpl2 = separate(smpl, col = Salary_copy, into = c('Salary_copy', 'Periodicity'), sep = -3, remove = TRUE, convert = TRUE) # separates periodicity to separate column
smpl2
我面临的问题是一行包含 3 个字符作为货币符号 (AFN),而其他的是单个字符。
所以上面提到的这些特定代码行无法将模式分成特定行的相应列。
如果我更改代码的 sep 参数的索引,所有其他行都会受到影响。我该如何解决这个特定问题?
可能的解决方案:
library(tidyverse)
df %>%
separate(Salary, sep="((?<=^\D)(?=\d))|((?<=\D)\s)", into=str_c("col", 1:2)) %>%
separate(col2, sep = "/", into = str_c("col",2:3))
#> Company.Name Job.Title Salaries.Reported Location col1
#> 2274 Ample Softech System Data Analyst 1 Pune ₹
#> 2275 Ziff Davis LLC Data Analyst 1 Pune ₹
#> 2276 IIM Kozhikkode Data Analyst 1 Pune ₹
#> 2277 Perennial Data Analyst 1 Pune ₹
#> 2278 Irupar Sociedad Cooperativa Data Analyst 1 Pune ₹
#> 2279 md Data Analyst 1 Pune AFN
#> 2280 Data Analyst 1 Pune ₹
#> col2 col3
#> 2274 35,563 mo
#> 2275 5,21,474 yr
#> 2276 7,64,702 yr
#> 2277 16,123 mo
#> 2278 6,04,401 yr
#> 2279 1,56,179 yr
#> 2280 23,500 mo
另一个使用 extract
和更简单的正则表达式的解决方案。一个额外的步骤修剪空格并从工资金额中删除逗号。
df2 <- df %>%
extract(Salary, c('currency', 'amount', 'period'), '^(\D+)([0-9,]+)/(.*)') %>%
mutate(
currency = gsub(' ', '', currency),
amount = as.numeric(gsub(',', '', amount))
)
Company.Name Job.Title Salaries.Reported Location currency amount period
2274 Ample Softech System Data Analyst 1 Pune ₹ 35563 mo
2275 Ziff Davis LLC Data Analyst 1 Pune ₹ 521474 yr
2276 IIM Kozhikkode Data Analyst 1 Pune ₹ 764702 yr
2277 Perennial Data Analyst 1 Pune ₹ 16123 mo
2278 Irupar Sociedad Cooperativa Data Analyst 1 Pune ₹ 604401 yr
2279 md Data Analyst 1 Pune AFN 156179 yr
2280 Data Analyst 1 Pune ₹ 23500 mo