如何根据优先词列表(使用 for 循环和条件)查找和替换 df 中的值?
How to find and replace values in a df according to a list of priority words (with for loop and condition)?
我在数据框中有一列,每个单元格中有多个单词,用“;”分隔(第二列)。
my_dataframe <- data.frame( first_column = c("x", "y", "x", "x", "y"),
second_column = c("important; very important; not important",
"not important; important; very important",
"very important; important",
"important; not important",
"not important"))
> my_dataframe
first_column second_column
1 x important; very important; not important
2 y not important; important; very important
3 x very important; important
4 x important; not important
5 y not important
我想在每个单元格中保留一个词:最重要的一个。
所以我按优先顺序列出了单词:
reference_importance <- list("very important", "important", "not important")
我希望在第二列获得的内容:
second_column
1 very important
2 very important
3 very important
4 important
5 not important
我试过了
for (i in 1:dim(my_dataframe)[1]) {
for (j in 1:length(reference_importance)) {
if (j %in% my_dataframe$second_column){
my_dataframe$second_column[i] <- paste(j)
break}
}
}
然后我认为问题是它没有考虑用“;”分隔的不同单词。所以我尝试了这个:
for (i in 1:dim(my_dataframe)[1]) {
value_as_list <- strsplit(my_dataframe$second_column[i], ";")
print(value_as_list)
for (j in reference_importance) {
if (j %in% value_as_list){
my_dataframe$second_column[i] == j
break}
}
}
但是这些并没有改变我专栏中的任何内容...
(我做这个例子是为了简化,但实际上我有一个巨大的 table 有更多的词和可能性。这就是为什么我尝试用循环来做,我不只是分配手动可能的答案。)
基本上使用strsplit
和match
。
my_dataframe <- transform(my_dataframe, z=strsplit(second_column, '; ') |>
lapply(match, reference_importance) |>
sapply(min) |>
{\(x) unlist(reference_importance)[x]}())
my_dataframe
# first_column second_column z
# 1 x important; very important; not important very important
# 2 y not important; important; very important very important
# 3 x very important; important very important
# 4 x important; not important important
# 5 y not important not important
注意: R >= 4.1 使用。
如果你需要一个循环,你可以这样做
spl <- strsplit(my_dataframe$second_column, '; ')
my_dataframe$z <- NA_character_
for (i in seq_along(spl)) {
my_dataframe$z[i] <- reference_importance[[min(match(spl[[i]], reference_importance))]]
}
my_dataframe
# first_column second_column z
# 1 x important; very important; not important very important
# 2 y not important; important; very important very important
# 3 x very important; important very important
# 4 x important; not important important
# 5 y not important not important
当然我使用 z
作为演示目的,实际上您会使用 second_column
而不是 z
。
如果你想使用循环,以下方法对我有用:
my_dataframe <- data.frame( first_column = c("x", "y", "x", "x", "y"),
second_column = c("important; very important; not important",
"not important; important; very important",
"very important; important",
"important; not important",
"not important"))
reference_importance <- list("very important", "important", "not important")
# add new column for priority word
my_dataframe <- my_dataframe %>%
mutate(Priority_importance = NA)
# use a loop to identify highest priority substring
for (i in 1:nrow(my_dataframe)) {
value_as_list <- strsplit(my_dataframe$second_column[i], ";")
for (j in 1:length(reference_importance)) {
if (value_as_list == as.character((reference_importance[j]))) {
my_dataframe$Priority_importance[i] <- reference_importance[j] # paste importance level
break # move to next iteration
}
}
}
my_dataframe
first_column second_column Priority_importance
1 x important; very important; not important very important
2 y not important; important; very important very important
3 x very important; important very important
4 x important; not important important
5 y not important not important
dplyr
和 tidyr
的一个选项:
my_dataframe %>%
rowid_to_column() %>%
separate_rows(second_column, sep = "; ") %>%
group_by(rowid) %>%
slice_min(match(second_column, reference_importance))
rowid first_column second_column
<int> <chr> <chr>
1 1 x very important
2 2 y very important
3 3 x very important
4 4 x important
5 5 y not important
我使用 reference_importance 作为字符向量而不是列表:
reference_importance <- c("very important", "important", "not important")
另一种可能的解决方案,基于tidyverse
:
library(tidyverse)
my_dataframe %>%
mutate(id = row_number()) %>%
separate_rows(second_column, sep = "\s*;\s*") %>%
group_by(id) %>%
slice(match(reference_importance, second_column) %>% na.omit() %>% .[1]) %>%
ungroup %>%
select(-id)
#> # A tibble: 5 × 2
#> first_column second_column
#> <chr> <chr>
#> 1 x very important
#> 2 y very important
#> 3 x very important
#> 4 x important
#> 5 y not important
我在数据框中有一列,每个单元格中有多个单词,用“;”分隔(第二列)。
my_dataframe <- data.frame( first_column = c("x", "y", "x", "x", "y"),
second_column = c("important; very important; not important",
"not important; important; very important",
"very important; important",
"important; not important",
"not important"))
> my_dataframe
first_column second_column
1 x important; very important; not important
2 y not important; important; very important
3 x very important; important
4 x important; not important
5 y not important
我想在每个单元格中保留一个词:最重要的一个。
所以我按优先顺序列出了单词:
reference_importance <- list("very important", "important", "not important")
我希望在第二列获得的内容:
second_column
1 very important
2 very important
3 very important
4 important
5 not important
我试过了
for (i in 1:dim(my_dataframe)[1]) {
for (j in 1:length(reference_importance)) {
if (j %in% my_dataframe$second_column){
my_dataframe$second_column[i] <- paste(j)
break}
}
}
然后我认为问题是它没有考虑用“;”分隔的不同单词。所以我尝试了这个:
for (i in 1:dim(my_dataframe)[1]) {
value_as_list <- strsplit(my_dataframe$second_column[i], ";")
print(value_as_list)
for (j in reference_importance) {
if (j %in% value_as_list){
my_dataframe$second_column[i] == j
break}
}
}
但是这些并没有改变我专栏中的任何内容...
(我做这个例子是为了简化,但实际上我有一个巨大的 table 有更多的词和可能性。这就是为什么我尝试用循环来做,我不只是分配手动可能的答案。)
基本上使用strsplit
和match
。
my_dataframe <- transform(my_dataframe, z=strsplit(second_column, '; ') |>
lapply(match, reference_importance) |>
sapply(min) |>
{\(x) unlist(reference_importance)[x]}())
my_dataframe
# first_column second_column z
# 1 x important; very important; not important very important
# 2 y not important; important; very important very important
# 3 x very important; important very important
# 4 x important; not important important
# 5 y not important not important
注意: R >= 4.1 使用。
如果你需要一个循环,你可以这样做
spl <- strsplit(my_dataframe$second_column, '; ')
my_dataframe$z <- NA_character_
for (i in seq_along(spl)) {
my_dataframe$z[i] <- reference_importance[[min(match(spl[[i]], reference_importance))]]
}
my_dataframe
# first_column second_column z
# 1 x important; very important; not important very important
# 2 y not important; important; very important very important
# 3 x very important; important very important
# 4 x important; not important important
# 5 y not important not important
当然我使用 z
作为演示目的,实际上您会使用 second_column
而不是 z
。
如果你想使用循环,以下方法对我有用:
my_dataframe <- data.frame( first_column = c("x", "y", "x", "x", "y"),
second_column = c("important; very important; not important",
"not important; important; very important",
"very important; important",
"important; not important",
"not important"))
reference_importance <- list("very important", "important", "not important")
# add new column for priority word
my_dataframe <- my_dataframe %>%
mutate(Priority_importance = NA)
# use a loop to identify highest priority substring
for (i in 1:nrow(my_dataframe)) {
value_as_list <- strsplit(my_dataframe$second_column[i], ";")
for (j in 1:length(reference_importance)) {
if (value_as_list == as.character((reference_importance[j]))) {
my_dataframe$Priority_importance[i] <- reference_importance[j] # paste importance level
break # move to next iteration
}
}
}
my_dataframe
first_column second_column Priority_importance
1 x important; very important; not important very important
2 y not important; important; very important very important
3 x very important; important very important
4 x important; not important important
5 y not important not important
dplyr
和 tidyr
的一个选项:
my_dataframe %>%
rowid_to_column() %>%
separate_rows(second_column, sep = "; ") %>%
group_by(rowid) %>%
slice_min(match(second_column, reference_importance))
rowid first_column second_column
<int> <chr> <chr>
1 1 x very important
2 2 y very important
3 3 x very important
4 4 x important
5 5 y not important
我使用 reference_importance 作为字符向量而不是列表:
reference_importance <- c("very important", "important", "not important")
另一种可能的解决方案,基于tidyverse
:
library(tidyverse)
my_dataframe %>%
mutate(id = row_number()) %>%
separate_rows(second_column, sep = "\s*;\s*") %>%
group_by(id) %>%
slice(match(reference_importance, second_column) %>% na.omit() %>% .[1]) %>%
ungroup %>%
select(-id)
#> # A tibble: 5 × 2
#> first_column second_column
#> <chr> <chr>
#> 1 x very important
#> 2 y very important
#> 3 x very important
#> 4 x important
#> 5 y not important