使用引用替换多个值 table
Replace multiple values using a reference table
我正在清理一个数据库,其中一个字段是“国家”,但是我数据库中的国家名称与我需要的输出不匹配。
虽然我想使用 str_replace 功能,但我有超过 50 个国家需要修复,所以这不是最有效的方法。我已经准备了一个 CSV 文件,其中包含我需要参考的原始国家/地区输入和输出。
这是我目前的情况:
library(stringr)
library(dplyr)
library(tidyr)
library(readxl)
database1<- read_excel("database.xlsx")
database1$country<str_replace(database1$country,"USA","United States")
database1$country<str_replace(database1$country,"UK","United Kingdom")
database1$country<str_replace(database1$country,"Bolivia","Bolivia,Plurinational State of")
write.csv(database1, "test.csv", row.names=FALSE, fileEncoding = 'UTF 8', na="")
注意:factor
中的级别和标签必须是唯一的,否则不应包含重复项。
# database1 <- read_excel("database.xlsx") ## read database excel book
old_names <- c("USA", "UGA", "CHL") ## country abbreviations
new_names <- c("United States", "Uganda", "Chile") ## country full form
基础 R
database1 <- within( database1, country <- factor( country, levels = old_names, labels = new_names ))
Data.Table
library('data.table')
setDT(database1)
database1[, country := factor(country, levels = old_names, labels = new_names)]
database1
# country
# 1: United States
# 2: Uganda
# 3: Chile
# 4: United States
# 5: Uganda
# 6: Chile
# 7: United States
# 8: Uganda
# 9: Chile
数据
database1 <- data.frame(country = c("USA", "UGA", "CHL", "USA", "UGA", "CHL", "USA", "UGA", "CHL"))
# country
# 1 USA
# 2 UGA
# 3 CHL
# 4 USA
# 5 UGA
# 6 CHL
# 7 USA
# 8 UGA
# 9 CHL
编辑:
您可以创建一个命名向量 countries
,而不是两个变量,例如 old_names 和 new_names.
countries <- c("USA", "UGA", "CHL")
names(countries) <- c("United States", "Uganda", "Chile")
within( database1, country <- factor( country, levels = countries, labels = names(countries) ))
过去曾使用类似的方法使用 .csv 文件进行批量替换,解决过类似的问题。
.csv 文件格式示例:
library(data.table)
## Generate example replacements csv file to see the format used
Replacements <- data.table(Old = c("USA","UGA","CHL"),
New = c("United States", "Uganda", "Chile"))
fwrite(Replacements,"Replacements.csv")
获得 "Replacements.csv" 后,您可以使用它一次性替换所有名称 stringi::replace_all_regex()
。 (对于它的价值,几乎整个 stringr
包本质上是对 stringi
调用的包装。由于 stringi
运行速度稍快并且具有更大的功能集,我更愿意坚持stringi
.) See stringi vs stringr blog by HRBRMSTR
library(data.table)
library(readxl)
library(stringi)
## Read in list of replacements
Replacements <- fread("Replacements.csv")
## Read in file to be cleaned
database1<- read_excel("database.xlsx")
## Perform Replacements
database1$countries <- stringi::stri_replace_all_regex(database1$countries,
"^"%s+%Replacements$Old%s+%"$",
Replacements$New,
vectorize_all = FALSE)
## Write CSV
write.csv(database1, "test.csv", row.names=FALSE, fileEncoding = 'UTF 8', na="")
我尝试在可能的情况下使用上面的基础 R data.frame
语法以避免任何混淆,但如果我这样做是为了我自己,我会坚持使用完整的 data.table
语法,如下所示:
library(data.table)
library(readxl)
library(stringi)
## Read in list of replacements
Replacements <- fread("Replacements.csv")
## Read in file to be cleaned
database1<- read_excel("database.xlsx")
## Perform Replacements
database1[, countries := stri_replace_all_regex(countries,"^"%s+%Replacements[,Old]%s+%"$",
Replacements[,New],
vectorize_all = FALSE)]
## Write CSV
fwrite(database1,"test.csv")
我正在清理一个数据库,其中一个字段是“国家”,但是我数据库中的国家名称与我需要的输出不匹配。
虽然我想使用 str_replace 功能,但我有超过 50 个国家需要修复,所以这不是最有效的方法。我已经准备了一个 CSV 文件,其中包含我需要参考的原始国家/地区输入和输出。
这是我目前的情况:
library(stringr)
library(dplyr)
library(tidyr)
library(readxl)
database1<- read_excel("database.xlsx")
database1$country<str_replace(database1$country,"USA","United States")
database1$country<str_replace(database1$country,"UK","United Kingdom")
database1$country<str_replace(database1$country,"Bolivia","Bolivia,Plurinational State of")
write.csv(database1, "test.csv", row.names=FALSE, fileEncoding = 'UTF 8', na="")
注意:factor
中的级别和标签必须是唯一的,否则不应包含重复项。
# database1 <- read_excel("database.xlsx") ## read database excel book
old_names <- c("USA", "UGA", "CHL") ## country abbreviations
new_names <- c("United States", "Uganda", "Chile") ## country full form
基础 R
database1 <- within( database1, country <- factor( country, levels = old_names, labels = new_names ))
Data.Table
library('data.table')
setDT(database1)
database1[, country := factor(country, levels = old_names, labels = new_names)]
database1
# country
# 1: United States
# 2: Uganda
# 3: Chile
# 4: United States
# 5: Uganda
# 6: Chile
# 7: United States
# 8: Uganda
# 9: Chile
数据
database1 <- data.frame(country = c("USA", "UGA", "CHL", "USA", "UGA", "CHL", "USA", "UGA", "CHL"))
# country
# 1 USA
# 2 UGA
# 3 CHL
# 4 USA
# 5 UGA
# 6 CHL
# 7 USA
# 8 UGA
# 9 CHL
编辑:
您可以创建一个命名向量 countries
,而不是两个变量,例如 old_names 和 new_names.
countries <- c("USA", "UGA", "CHL")
names(countries) <- c("United States", "Uganda", "Chile")
within( database1, country <- factor( country, levels = countries, labels = names(countries) ))
过去曾使用类似的方法使用 .csv 文件进行批量替换,解决过类似的问题。
.csv 文件格式示例:
library(data.table)
## Generate example replacements csv file to see the format used
Replacements <- data.table(Old = c("USA","UGA","CHL"),
New = c("United States", "Uganda", "Chile"))
fwrite(Replacements,"Replacements.csv")
获得 "Replacements.csv" 后,您可以使用它一次性替换所有名称 stringi::replace_all_regex()
。 (对于它的价值,几乎整个 stringr
包本质上是对 stringi
调用的包装。由于 stringi
运行速度稍快并且具有更大的功能集,我更愿意坚持stringi
.) See stringi vs stringr blog by HRBRMSTR
library(data.table)
library(readxl)
library(stringi)
## Read in list of replacements
Replacements <- fread("Replacements.csv")
## Read in file to be cleaned
database1<- read_excel("database.xlsx")
## Perform Replacements
database1$countries <- stringi::stri_replace_all_regex(database1$countries,
"^"%s+%Replacements$Old%s+%"$",
Replacements$New,
vectorize_all = FALSE)
## Write CSV
write.csv(database1, "test.csv", row.names=FALSE, fileEncoding = 'UTF 8', na="")
我尝试在可能的情况下使用上面的基础 R data.frame
语法以避免任何混淆,但如果我这样做是为了我自己,我会坚持使用完整的 data.table
语法,如下所示:
library(data.table)
library(readxl)
library(stringi)
## Read in list of replacements
Replacements <- fread("Replacements.csv")
## Read in file to be cleaned
database1<- read_excel("database.xlsx")
## Perform Replacements
database1[, countries := stri_replace_all_regex(countries,"^"%s+%Replacements[,Old]%s+%"$",
Replacements[,New],
vectorize_all = FALSE)]
## Write CSV
fwrite(database1,"test.csv")