从列中识别和去除字符

Identify and strip characters from colums

我有一个大型数据集,我想在其中识别并删除字符和符号以仅保留数字值。

例如我希望 -£1125.91m-1125.91

dataset
  Event                       var1       var2  
  <fct>                       <chr>      <chr> 
1 Labour Costs YoY            13.34m     0.026 
2 Unemployment Change (000's) .91b    -0.449
3 Unemployment Rate           -£1125.91m 0.89k 
4 Jobseekers Net Change       ¥1012.74b  9.56m

目前我知道如何从列中删除单个字符。像这样:

dataset$`var1` <- gsub("k", "", dataset$`var`)

手动执行此操作将需要大量工作,因为数据集非常大。我想知道您是否可以一次识别并删除所有字符,还有货币符号以及 m 和 b?

要复制数据集:

dataset <- structure(list(Event = structure(2:5, .Label = c("Event", "Labour Costs YoY", 
                                                    "Unemployment Change (000's)", "Unemployment Rate", "Jobseekers Net Change"), 
                                    .Names = c("", "", "", ""), class = "factor"), var1 = c("13.34m", ".91b", "-£1125.91m", "¥1012.74b"), var2 = c(0.026, -0.449, "0.89k", "9.56m")), row.names = c(NA, 
                                                                                                                                                                                                                 -4L), class = c("tbl_df", "tbl", "data.frame"))

要删除连字符、数字或点以外的所有内容,您可以使用

dataset$var1 <- gsub("[^-0-9.]", "", dataset$var1)

[^-0-9.] 模式是一个取反字符 class,它匹配 class 中定义的字符以外的任何字符。

参见regex demo online

参见an online R demo

dataset <- structure(list(Event = structure(2:5, .Label = c("Event", "Labour Costs YoY", 
    "Unemployment Change (000's)", "Unemployment Rate", "Jobseekers Net Change"), 
   .Names = c("", "", "", ""), class = "factor"), var1 = c("13.34m", ".91b", "-£1125.91m", "¥1012.74b"), var2 = c(0.026, -0.449, "0.89k", "9.56m")), row.names = c(NA, 
   -4L), class = c("tbl_df", "tbl", "data.frame"))
gsub("[^-0-9.,]", "", dataset$var1)
##  => [1] "13.34"    "16.91"    "-1125.91" "1012.74"