从列中识别和去除字符
Identify and strip characters from colums
我有一个大型数据集,我想在其中识别并删除字符和符号以仅保留数字值。
例如我希望 -£1125.91m
为 -1125.91
dataset
Event var1 var2
<fct> <chr> <chr>
1 Labour Costs YoY 13.34m 0.026
2 Unemployment Change (000's) .91b -0.449
3 Unemployment Rate -£1125.91m 0.89k
4 Jobseekers Net Change ¥1012.74b 9.56m
目前我知道如何从列中删除单个字符。像这样:
dataset$`var1` <- gsub("k", "", dataset$`var`)
手动执行此操作将需要大量工作,因为数据集非常大。我想知道您是否可以一次识别并删除所有字符,还有货币符号以及 m 和 b?
要复制数据集:
dataset <- structure(list(Event = structure(2:5, .Label = c("Event", "Labour Costs YoY",
"Unemployment Change (000's)", "Unemployment Rate", "Jobseekers Net Change"),
.Names = c("", "", "", ""), class = "factor"), var1 = c("13.34m", ".91b", "-£1125.91m", "¥1012.74b"), var2 = c(0.026, -0.449, "0.89k", "9.56m")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
要删除连字符、数字或点以外的所有内容,您可以使用
dataset$var1 <- gsub("[^-0-9.]", "", dataset$var1)
[^-0-9.]
模式是一个取反字符 class,它匹配 class 中定义的字符以外的任何字符。
dataset <- structure(list(Event = structure(2:5, .Label = c("Event", "Labour Costs YoY",
"Unemployment Change (000's)", "Unemployment Rate", "Jobseekers Net Change"),
.Names = c("", "", "", ""), class = "factor"), var1 = c("13.34m", ".91b", "-£1125.91m", "¥1012.74b"), var2 = c(0.026, -0.449, "0.89k", "9.56m")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
gsub("[^-0-9.,]", "", dataset$var1)
## => [1] "13.34" "16.91" "-1125.91" "1012.74"
我有一个大型数据集,我想在其中识别并删除字符和符号以仅保留数字值。
例如我希望 -£1125.91m
为 -1125.91
dataset
Event var1 var2
<fct> <chr> <chr>
1 Labour Costs YoY 13.34m 0.026
2 Unemployment Change (000's) .91b -0.449
3 Unemployment Rate -£1125.91m 0.89k
4 Jobseekers Net Change ¥1012.74b 9.56m
目前我知道如何从列中删除单个字符。像这样:
dataset$`var1` <- gsub("k", "", dataset$`var`)
手动执行此操作将需要大量工作,因为数据集非常大。我想知道您是否可以一次识别并删除所有字符,还有货币符号以及 m 和 b?
要复制数据集:
dataset <- structure(list(Event = structure(2:5, .Label = c("Event", "Labour Costs YoY",
"Unemployment Change (000's)", "Unemployment Rate", "Jobseekers Net Change"),
.Names = c("", "", "", ""), class = "factor"), var1 = c("13.34m", ".91b", "-£1125.91m", "¥1012.74b"), var2 = c(0.026, -0.449, "0.89k", "9.56m")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
要删除连字符、数字或点以外的所有内容,您可以使用
dataset$var1 <- gsub("[^-0-9.]", "", dataset$var1)
[^-0-9.]
模式是一个取反字符 class,它匹配 class 中定义的字符以外的任何字符。
dataset <- structure(list(Event = structure(2:5, .Label = c("Event", "Labour Costs YoY",
"Unemployment Change (000's)", "Unemployment Rate", "Jobseekers Net Change"),
.Names = c("", "", "", ""), class = "factor"), var1 = c("13.34m", ".91b", "-£1125.91m", "¥1012.74b"), var2 = c(0.026, -0.449, "0.89k", "9.56m")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
gsub("[^-0-9.,]", "", dataset$var1)
## => [1] "13.34" "16.91" "-1125.91" "1012.74"