如何在包含特定字符串的列中进行数据整理和变异?
How to data wrangle and mutate at column containing specific string?
很难用语言来形容。因此,做了一个reprex
输入、输出和预期输出低于
我们如何处理数据
1.当我们如下所示进行函数和变异时,每次都会根据列名字符串产生歧义
2. 一旦我们有了唯一的列名,我们如何绑定它们
library(tidyverse)
# Basically, "." means ",". So, better we remove . and PC and convert to Numeric
df1 <- tribble(
~`ABC sales 01.01.2019 - 01.02.2019`, ~code,
"1.019 PC", 2000, # Actually, it 1019 (remove . and PC )
"100 PC", 2101,
"3.440 PC", 2002
)
df2 <- tribble(
~`ABC sales 01.03.2019 - 01.04.2019`, ~year,
"6.019 PC", 2019,
"20 PC", 2001,
"043.440 PC", 2002
)
df3 <- tribble(
~`ABC sales 01.05.2019 - 01.06.2019`, ~year,
"1.019 PC", 2000,
"701 PC", 2101,
"6.440 PC", 2002
)
# Input data
input_df = list(df1,df2,df3)
#### function to clean data
# str_replace is used twice because
# remove PC and dot
data_read = function(file){
df_ <- df %>% #glimpse()
# Select the column to remove PC, spaces and .
# Each time, column name differs so, `ABC sales 01.01.2019 - 01.02.2019` cannot be used
mutate_at(sales_dot = str_replace(select(contains('ABC')), "PC",""),
sales = str_replace(sales_dot, "\.",""), # name the new column so that rbind can be applied later
sales_dot = NULL, # delete the old column
vars(contains("ABC")) = NULL # delete the old column
)
df_
}
# attempt to resolve
# To clean the data from dots and PC
output_df1 <- map(input_df, data_read) # or lapply ?
# rbind
output = map(output_df1, rbind) # or lapply ?
expected_output <- df3 <- tribble(
~sales, ~year,
"1019", 2000,
"100", 2101,
"3440", 2002,
"6019", 2019,
"20", 2001,
"043440", 2002,
"1019", 2000,
"701", 2101,
"6440", 2002
)
使用 purrr
、dplyr
和 stringr
,您可以:
map_df(.x = input_df, ~ .x %>%
set_names(., c("sales", "year"))) %>%
mutate(sales = str_remove_all(sales, "[. PC]"))
sales year
<chr> <dbl>
1 1019 2000
2 100 2101
3 3440 2002
4 6019 2019
5 20 2001
6 043440 2002
7 1019 2000
8 701 2101
9 6440 2002
很难用语言来形容。因此,做了一个reprex 输入、输出和预期输出低于
我们如何处理数据 1.当我们如下所示进行函数和变异时,每次都会根据列名字符串产生歧义 2. 一旦我们有了唯一的列名,我们如何绑定它们
library(tidyverse)
# Basically, "." means ",". So, better we remove . and PC and convert to Numeric
df1 <- tribble(
~`ABC sales 01.01.2019 - 01.02.2019`, ~code,
"1.019 PC", 2000, # Actually, it 1019 (remove . and PC )
"100 PC", 2101,
"3.440 PC", 2002
)
df2 <- tribble(
~`ABC sales 01.03.2019 - 01.04.2019`, ~year,
"6.019 PC", 2019,
"20 PC", 2001,
"043.440 PC", 2002
)
df3 <- tribble(
~`ABC sales 01.05.2019 - 01.06.2019`, ~year,
"1.019 PC", 2000,
"701 PC", 2101,
"6.440 PC", 2002
)
# Input data
input_df = list(df1,df2,df3)
#### function to clean data
# str_replace is used twice because
# remove PC and dot
data_read = function(file){
df_ <- df %>% #glimpse()
# Select the column to remove PC, spaces and .
# Each time, column name differs so, `ABC sales 01.01.2019 - 01.02.2019` cannot be used
mutate_at(sales_dot = str_replace(select(contains('ABC')), "PC",""),
sales = str_replace(sales_dot, "\.",""), # name the new column so that rbind can be applied later
sales_dot = NULL, # delete the old column
vars(contains("ABC")) = NULL # delete the old column
)
df_
}
# attempt to resolve
# To clean the data from dots and PC
output_df1 <- map(input_df, data_read) # or lapply ?
# rbind
output = map(output_df1, rbind) # or lapply ?
expected_output <- df3 <- tribble(
~sales, ~year,
"1019", 2000,
"100", 2101,
"3440", 2002,
"6019", 2019,
"20", 2001,
"043440", 2002,
"1019", 2000,
"701", 2101,
"6440", 2002
)
使用 purrr
、dplyr
和 stringr
,您可以:
map_df(.x = input_df, ~ .x %>%
set_names(., c("sales", "year"))) %>%
mutate(sales = str_remove_all(sales, "[. PC]"))
sales year
<chr> <dbl>
1 1019 2000
2 100 2101
3 3440 2002
4 6019 2019
5 20 2001
6 043440 2002
7 1019 2000
8 701 2101
9 6440 2002