R - 将唯一值的向量映射到具有重复项的数据框列

Question

我在数据框中有一列是字符向量。我想在我的数据框中添加一个包含唯一 ID values/codes 的列，该 ID 对应于该列中的每个唯一值。这是一些玩具数据：

fnames <- c("joey", "joey", "joey", "jimmy", "jimmy", "tommy", "michael", "michael", "michael", "michael", "michael", "kevin", "kevin", "christopher", "aaron", "joshua", "joshua", "joshua", "arvid", "aiden", "kentavious", "lawrence", "xavier")

names <- as.data.frame(fnames)

获取fnames的唯一值个数 I 运行:

unique_fnames <- length(unique(names$fnames))

要为每个唯一名称生成唯一 ID，我找到了以下函数：

create_unique_ids <- function(n, seed_no = 16169, char_len = 6){
  set.seed(seed_no)
  pool <- c(letters, LETTERS, 0:9)
  
  res <- character(n)
  for(i in seq(n)){
    this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
    while(this_res %in% res){
      this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
    }
    res[i] <- this_res
  }
  res
}

将 create_unique_ids 应用到 unique_fnames 我得到了所需数量的 ID 代码：

unique_fname_id <- create_unique_ids(unique_fnames)

我的问题是：

如何将 unique_fname_id 的向量添加到我的数据框 names？期望的结果是一个数据框 names 和一个 unique_fname_id 列，看起来像这样：

unique_fname_id <- c("VvWMKt", "VvWMKt", "VvWMKt", "yEbpFq", "yEbpFq", "Z3xCdO"...)

其中"VvWMKt"对应"joey"，"yEbpFq"对应"jimmy"等。数据框 names 将与原始长度相同，只是添加了这一列。

有办法吗？欢迎和赞赏所有建议。谢谢！

编辑：我需要在create_unique_ids函数中保留set.seed以确保生成的ID可以连续复制。

Answer 1

如果你想使用你的函数并保留种子，你可以这样做：

names %>% 
  distinct(fnames) %>% 
  bind_cols(unique_ID = create_unique_ids(13)) %>% 
  left_join(names)

您还可以从您的函数中删除种子（set.seed(seed_no) 行和参数）并获得更简单的解决方案：

names %>% 
  group_by(fnames) %>% 
  mutate(unique_ID = create_unique_ids(1))

   fnames  unique_ID
   <chr>   <chr>    
 1 joey    ea10KC   
 2 joey    ea10KC   
 3 joey    ea10KC   
 4 jimmy   MD5W4d   
 5 jimmy   MD5W4d   
 6 tommy   xR7ozW   
 7 michael uuGn3h   
 8 michael uuGn3h   
 9 michael uuGn3h   
10 michael uuGn3h   
# ... with 13 more rows

您还可以使用 stringi::stri_rand_strings 等内置函数，它会创建具有固定字符数的随机字母数字字符串：

library(stringi); library(dplyr)

names %>% 
  group_by(fnames) %>% 
  mutate(unique_ID = stri_rand_strings(1, 6))

Answer 2

一个粗略的做法是左拼回去

library(tidyverse)

fnames <- c("joey", "joey", "joey", "jimmy", "jimmy", "tommy", "michael", "michael", "michael", "michael", "michael", "kevin", "kevin", "christopher", "aaron", "joshua", "joshua", "joshua", "arvid", "aiden", "kentavious", "lawrence", "xavier")

names <- as.data.frame(fnames)


unique_names <- names |> distinct()

unique_fnames <- length(unique(names$fnames))

create_unique_ids <- function(n, seed_no = 16169, char_len = 6){
  set.seed(seed_no)
  pool <- c(letters, LETTERS, 0:9)
  
  res <- character(n)
  for(i in seq(n)){
    this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
    while(this_res %in% res){
      this_res <- paste0(sample(pool, char_len, replace = TRUE), collapse = "")
    }
    res[i] <- this_res
  }
  res
}

unique_fname_id <- create_unique_ids(unique_fnames)


df_ids <- tibble(fnames = unique_names |> pull(fnames),unique_fname_id = unique_fname_id)


names |> 
  left_join(df_ids)
#> Joining, by = "fnames"
#>         fnames unique_fname_id
#> 1         joey          VvWMKt
#> 2         joey          VvWMKt
#> 3         joey          VvWMKt
#> 4        jimmy          yEbpFq
#> 5        jimmy          yEbpFq
#> 6        tommy          Z3xCdO
#> 7      michael          ef8YkZ
#> 8      michael          ef8YkZ
#> 9      michael          ef8YkZ
#> 10     michael          ef8YkZ
#> 11     michael          ef8YkZ
#> 12       kevin          kDBFAq
#> 13       kevin          kDBFAq
#> 14 christopher          xR77mJ
#> 15       aaron          gaaI1C
#> 16      joshua          KM4dD9
#> 17      joshua          KM4dD9
#> 18      joshua          KM4dD9
#> 19       arvid          oTLl7g
#> 20       aiden          b63PnV
#> 21  kentavious          csnWuE
#> 22    lawrence          Ihi5VM
#> 23      xavier          HfM0mX

^{由 reprex package (v2.0.1)}

于 2021-12-03 创建

R - 将唯一值的向量映射到具有重复项的数据框列

R - map vector of unique values to dataframe column with duplicates

random

r

unique

dataframe