如何将向量中的多个值重新编码为一个值?
How to recode multiple values in vector into one value?
我有一个问题,我的 data.frame
由于不同的数据源而包含不同的属性。例如,state
列实际上具有相同的状态但具有不同的表示形式。请注意,我的实际数据未使用美国各州。
df <- data.frame(Names=c("Adam", "Mark", "Dahlia", "Jeff", "Derek",
"Arnold", "Sheppard", "Dwayne", "Nichols", "Shane"),
Age=c(27, 28, 29, 37, 26, 22, 29, 34, 31, 30),
States=c("AL", "Alaska", "Alabama", "WI",
"Wisconsin", "AZ", "Arizona", "AL", "WI", "AK"))
我正在尝试将 AL、WI、AZ 和 AK 等值分别重新编码为阿拉巴马州、威斯康星州、亚利桑那州和阿拉斯加。
到目前为止我遇到了:
case_when(
df$States == "AL" ~ "Alabama",
df$States == "AK" ~ "Alaska",
df$States == "WI" ~ "Wisconsin",
df$States == "AZ" ~ "Arizona",
)
它给了我输出:
[1] "Alabama" NA NA "Wisconsin" NA "Arizona" NA
[8] "Alabama" "Wisconsin" "Alaska"
我不想要 NA
值,所以我所做的是:
case_when(
df$States == "AL" ~ "Alabama",
df$States == "Alabama" ~ "Alabama",
df$States == "AK" ~ "Alaska",
df$States == "Alaska" ~ "Alaska",
df$States == "WI" ~ "Wisconsin",
df$States == "Wisconsin" ~ "Wisconsin",
df$States == "AZ" ~ "Arizona",
df$States == "Arizona" ~ "Arizona",
)
它给了我想要的输出,但我认为有更简单的方法可以做到这一点。
我正在考虑循环,因为稍后我想把它变成伪代码。但是,我 运行 不知道如何执行此操作。真的很感谢大家在这里帮忙。
谢谢。
如果您打算匹配美国州名,我们可以使用内置向量 state.abb
和 state.name
进行匹配和替换。
inds <- match(df$States, state.abb)
df$States[which(!is.na(inds))] <- state.name[na.omit(inds)]
df
# Names Age States
#1 Adam 27 Alabama
#2 Mark 28 Alaska
#3 Dahlia 29 Alabama
#4 Jeff 37 Wisconsin
#5 Derek 26 Wisconsin
#6 Arnold 22 Arizona
#7 Sheppard 29 Arizona
#8 Dwayne 34 Alabama
#9 Nichols 31 Wisconsin
#10 Shane 30 Alaska
您还可以使用 %in%
来减少 case_when
的长度,它可以使用 ==
比较多个向量而不是只比较一个向量
library(dplyr)
df %>%
mutate(States = case_when(States %in% c("AL", "Alabama") ~ "Alabama",
States %in% c("AK", "Alaska")~ "Alaska",
States %in% c("WI", "Wisconsin")~ "Wisconsin",
States %in% c("AZ", "Arizona")~ "Arizona",
TRUE ~ NA_character_))
您可以将 dplyr 的 recode
函数与命名向量一起使用。我使用 setNames
创建一个命名字符向量(类似于 key-value 对),但您可以使用您拥有的任何数据来创建向量。使用您的示例,我们可以设置一些键和值:
keys <- state.abb # the abbreviations you want to replace
vals <- state.name # the replacement values
keysvals <- setNames(vals, keys) # create named vector
现在呼叫recode
。确保使用 !!!
取消引用和拼接:
library(dplyr)
df$States <- recode(df$States, !!!keysvals)
哪个 return:
Names Age States
1 Adam 27 Alabama
2 Mark 28 Alaska
3 Dahlia 29 Alabama
4 Jeff 37 Wisconsin
5 Derek 26 Wisconsin
6 Arnold 22 Arizona
7 Sheppard 29 Arizona
8 Dwayne 34 Alabama
9 Nichols 31 Wisconsin
10 Shane 30 Alaska
我有一个问题,我的 data.frame
由于不同的数据源而包含不同的属性。例如,state
列实际上具有相同的状态但具有不同的表示形式。请注意,我的实际数据未使用美国各州。
df <- data.frame(Names=c("Adam", "Mark", "Dahlia", "Jeff", "Derek",
"Arnold", "Sheppard", "Dwayne", "Nichols", "Shane"),
Age=c(27, 28, 29, 37, 26, 22, 29, 34, 31, 30),
States=c("AL", "Alaska", "Alabama", "WI",
"Wisconsin", "AZ", "Arizona", "AL", "WI", "AK"))
我正在尝试将 AL、WI、AZ 和 AK 等值分别重新编码为阿拉巴马州、威斯康星州、亚利桑那州和阿拉斯加。
到目前为止我遇到了:
case_when(
df$States == "AL" ~ "Alabama",
df$States == "AK" ~ "Alaska",
df$States == "WI" ~ "Wisconsin",
df$States == "AZ" ~ "Arizona",
)
它给了我输出:
[1] "Alabama" NA NA "Wisconsin" NA "Arizona" NA
[8] "Alabama" "Wisconsin" "Alaska"
我不想要 NA
值,所以我所做的是:
case_when(
df$States == "AL" ~ "Alabama",
df$States == "Alabama" ~ "Alabama",
df$States == "AK" ~ "Alaska",
df$States == "Alaska" ~ "Alaska",
df$States == "WI" ~ "Wisconsin",
df$States == "Wisconsin" ~ "Wisconsin",
df$States == "AZ" ~ "Arizona",
df$States == "Arizona" ~ "Arizona",
)
它给了我想要的输出,但我认为有更简单的方法可以做到这一点。
我正在考虑循环,因为稍后我想把它变成伪代码。但是,我 运行 不知道如何执行此操作。真的很感谢大家在这里帮忙。
谢谢。
如果您打算匹配美国州名,我们可以使用内置向量 state.abb
和 state.name
进行匹配和替换。
inds <- match(df$States, state.abb)
df$States[which(!is.na(inds))] <- state.name[na.omit(inds)]
df
# Names Age States
#1 Adam 27 Alabama
#2 Mark 28 Alaska
#3 Dahlia 29 Alabama
#4 Jeff 37 Wisconsin
#5 Derek 26 Wisconsin
#6 Arnold 22 Arizona
#7 Sheppard 29 Arizona
#8 Dwayne 34 Alabama
#9 Nichols 31 Wisconsin
#10 Shane 30 Alaska
您还可以使用 %in%
来减少 case_when
的长度,它可以使用 ==
library(dplyr)
df %>%
mutate(States = case_when(States %in% c("AL", "Alabama") ~ "Alabama",
States %in% c("AK", "Alaska")~ "Alaska",
States %in% c("WI", "Wisconsin")~ "Wisconsin",
States %in% c("AZ", "Arizona")~ "Arizona",
TRUE ~ NA_character_))
您可以将 dplyr 的 recode
函数与命名向量一起使用。我使用 setNames
创建一个命名字符向量(类似于 key-value 对),但您可以使用您拥有的任何数据来创建向量。使用您的示例,我们可以设置一些键和值:
keys <- state.abb # the abbreviations you want to replace
vals <- state.name # the replacement values
keysvals <- setNames(vals, keys) # create named vector
现在呼叫recode
。确保使用 !!!
取消引用和拼接:
library(dplyr)
df$States <- recode(df$States, !!!keysvals)
哪个 return:
Names Age States
1 Adam 27 Alabama
2 Mark 28 Alaska
3 Dahlia 29 Alabama
4 Jeff 37 Wisconsin
5 Derek 26 Wisconsin
6 Arnold 22 Arizona
7 Sheppard 29 Arizona
8 Dwayne 34 Alabama
9 Nichols 31 Wisconsin
10 Shane 30 Alaska