基于R中的部分字符串匹配(在另一个数据帧中)替换数据帧中的NA
Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R
目标:根据另一个数据框中的“键”更改一个数据框中的一列 NA(类似 VLookUp,但仅在 R 中除外)
这里给定 df1(为了简单起见,我只有 6 行。我的关键是 50 行代表 50 个州):
Index
State_Name
Abbreviation
1
California
CA
2
Maryland
MD
3
New York
NY
4
Texas
TX
5
Virginia
VA
6
Washington
WA
并在此处给出 df2(这只是一个示例。我正在使用的真实数据框有更多行):
Index
State
Article
1
NA
Texas governor, Abbott, signs new abortion bill
2
NA
Effort to recall California governor Newsome loses steam
3
NA
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
NA
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
NA
Amazon HQ2 causing housing prices to soar in northern Virginia
任务:创建一个循环并读取每个 df2$Article 行中的状态的 R 函数;然后将其与 df1$State_Name 交叉引用,以根据 df2$Article 中的状态将 df2$State 中的 NA 替换为相应的 df1$Abbreviation 键。我知道这太难听了。我对如何开始和完成这个难题感到困惑。硬编码不是一种选择,因为我有数千行这样的真实数据表,并且会随着我们向 text-scrape 添加更多文章而更新。
输出应如下所示:
Index
State
Article
1
TX
Texas governor, Abbott, signs new abortion bill
2
CA
Effort to recall California governor Newsome loses steam
3
NY
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
MD
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
VA
Amazon HQ2 causing housing prices to soar in northern Virginia
注意:第 5 个带 DC 的条目是 NA。
任何指向指南的链接,and/or任何关于如何编码的建议都非常感谢。谢谢!
您可以从 State_Name
创建一个正则表达式模式,然后使用 str_extract
从 Article
中提取它。使用match
从df1
.
得到对应的Abbreviation
名字
library(stringr)
df2$State <- df1$Abbreviation[match(str_extract(df2$Article,
str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA "VA"
您还可以使用内置 state.name
和 state.abb
而不是 df1
来获取州名和缩写。
这是在 for
循环中执行此操作的方法 -
for(i in seq(nrow(df1))) {
inds <- grep(df1$State_Name[i], df2$Article)
if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2
# Index State Article
#1 1 TX Texas governor, Abbott, signs new abortion bill
#2 2 CA Effort to recall California governor Newsome loses steam
#3 3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4 4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5 5 <NA> DC statehood unlikely as Manchin opposes
#6 6 VA Amazon HQ2 causing housing prices to soar in northern Virginia
不像上面那样简洁,而是一种 Base R 方法:
# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
# Coerce 0 length vectors to na values of the appropriate type:
# .zero_to_nas => function()
.zero_to_nas <- function(x){
if(identical(x, character(0))){
NA_character_
}else if(identical(x, integer(0))){
NA_integer_
}else if(identical(x, numeric(0))){
NA_real_
}else if(identical(x, complex(0))){
NA_complex_
}else if(identical(x, logical(0))){
NA
}else{
x
}
}
# Unlist cleaned list: res => vector
res <- unlist(lapply(lst, .zero_to_nas))
# Explictly define return object: vector => GlobalEnv()
return(res)
}
# Classify each article as belonging to the appropriate state:
# clean_df => data.frame
clean_df <- transform(
df2,
State = df1$Abbreviation[
match(
list_2_vec(
regmatches(
Article,
gregexpr(
paste0(df1$State_Name, collapse = "|"), Article
)
)
),
df1$State_Name
)
]
)
# Data:
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland",
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA",
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA),
Article = c("Texas governor, Abbott, signs new abortion bill",
"Effort to recall California governor Newsome loses steam",
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data",
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions",
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))
目标:根据另一个数据框中的“键”更改一个数据框中的一列 NA(类似 VLookUp,但仅在 R 中除外)
这里给定 df1(为了简单起见,我只有 6 行。我的关键是 50 行代表 50 个州):
Index | State_Name | Abbreviation |
---|---|---|
1 | California | CA |
2 | Maryland | MD |
3 | New York | NY |
4 | Texas | TX |
5 | Virginia | VA |
6 | Washington | WA |
并在此处给出 df2(这只是一个示例。我正在使用的真实数据框有更多行):
Index | State | Article |
---|---|---|
1 | NA | Texas governor, Abbott, signs new abortion bill |
2 | NA | Effort to recall California governor Newsome loses steam |
3 | NA | New York governor, Cuomo, accused of manipulating Covid-19 nursing home data |
4 | NA | Hogan (Maryland, R) announces plans to lift statewide Covid restrictions |
5 | NA | DC statehood unlikely as Manchin opposes |
6 | NA | Amazon HQ2 causing housing prices to soar in northern Virginia |
任务:创建一个循环并读取每个 df2$Article 行中的状态的 R 函数;然后将其与 df1$State_Name 交叉引用,以根据 df2$Article 中的状态将 df2$State 中的 NA 替换为相应的 df1$Abbreviation 键。我知道这太难听了。我对如何开始和完成这个难题感到困惑。硬编码不是一种选择,因为我有数千行这样的真实数据表,并且会随着我们向 text-scrape 添加更多文章而更新。
输出应如下所示:
Index | State | Article |
---|---|---|
1 | TX | Texas governor, Abbott, signs new abortion bill |
2 | CA | Effort to recall California governor Newsome loses steam |
3 | NY | New York governor, Cuomo, accused of manipulating Covid-19 nursing home data |
4 | MD | Hogan (Maryland, R) announces plans to lift statewide Covid restrictions |
5 | NA | DC statehood unlikely as Manchin opposes |
6 | VA | Amazon HQ2 causing housing prices to soar in northern Virginia |
注意:第 5 个带 DC 的条目是 NA。
任何指向指南的链接,and/or任何关于如何编码的建议都非常感谢。谢谢!
您可以从 State_Name
创建一个正则表达式模式,然后使用 str_extract
从 Article
中提取它。使用match
从df1
.
Abbreviation
名字
library(stringr)
df2$State <- df1$Abbreviation[match(str_extract(df2$Article,
str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA "VA"
您还可以使用内置 state.name
和 state.abb
而不是 df1
来获取州名和缩写。
这是在 for
循环中执行此操作的方法 -
for(i in seq(nrow(df1))) {
inds <- grep(df1$State_Name[i], df2$Article)
if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2
# Index State Article
#1 1 TX Texas governor, Abbott, signs new abortion bill
#2 2 CA Effort to recall California governor Newsome loses steam
#3 3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4 4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5 5 <NA> DC statehood unlikely as Manchin opposes
#6 6 VA Amazon HQ2 causing housing prices to soar in northern Virginia
不像上面那样简洁,而是一种 Base R 方法:
# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
# Coerce 0 length vectors to na values of the appropriate type:
# .zero_to_nas => function()
.zero_to_nas <- function(x){
if(identical(x, character(0))){
NA_character_
}else if(identical(x, integer(0))){
NA_integer_
}else if(identical(x, numeric(0))){
NA_real_
}else if(identical(x, complex(0))){
NA_complex_
}else if(identical(x, logical(0))){
NA
}else{
x
}
}
# Unlist cleaned list: res => vector
res <- unlist(lapply(lst, .zero_to_nas))
# Explictly define return object: vector => GlobalEnv()
return(res)
}
# Classify each article as belonging to the appropriate state:
# clean_df => data.frame
clean_df <- transform(
df2,
State = df1$Abbreviation[
match(
list_2_vec(
regmatches(
Article,
gregexpr(
paste0(df1$State_Name, collapse = "|"), Article
)
)
),
df1$State_Name
)
]
)
# Data:
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland",
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA",
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA),
Article = c("Texas governor, Abbott, signs new abortion bill",
"Effort to recall California governor Newsome loses steam",
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data",
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions",
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))