R中的单热编码多字节字符串值

One-hot-encoding multi-byte string values in R

我从一项调查中收集了一些数据,该调查要求受访者对他们对玩家资料的偏好进行排名:

profile1: Tom, center, pitcher
profile2: Pete, right, hitter
profile3: Clay, left, hitter
profile4: Tom, right, fielder
profile5: Pete, left, fielder
profile6: Clay, center, pitcher

然而,由于不熟悉这个问卷开发软件,我收集的回复被存储为多字节字符串值,如下所示(对于每个受访者),然后读入 R:

preferences <- data.frame(pref = c("1. Pete, right, hitter\n2. Clay, center, pitcher\n3. Tom, right, fielder\n4. Tom, center, pitcher\n5. Clay, left, hitter\n6. Pete, left, fielder",
"1. Tom, right, fielder\n2. Clay, center, pitcher\n3. Pete, left, fielder\n4. Pete, right, hitter\n5. Tom, center, pitcher\n6. Clay, left, hitter",
"1. Clay, left, hitter\n2. Tom, center, pitcher\n3. Pete, right, hitter\n4. Pete, left, fielder\n5. Clay, center, pitcher\n6. Tom, right, fielder"))

我想知道是否有任何方法可以将受访者的每个排名选择映射到与上面给出的玩家资料相对应的不同列值,有点像单热编码 (OHE),然后转换结果转换成以下格式:

df <- data.frame(profile1 = c(4, 5, 2), profile2 = c(1, 4, 3), profile3 = c(5, 6, 1), profile4 = c(3, 1, 6), profile5 = c(6, 3, 4), profile6 = c(2, 2, 5))

df

  profile1 profile2 profile3 profile4 profile5 profile6
1        4        1        5        3        6        2
2        5        4        6        1        3        2
3        2        3        1        6        4        5

如有任何建议,我们将不胜感激。

您可以使用配置文件 (lookup) 创建查找 table,然后像这样操作 preferences 对象:

# Create data frame with six columns using `strsplit`
df=setNames(as.data.frame(tstrsplit(preferences$pref, "\n")), paste0("profile",1:6))

# pivot longer and merge with lookup, then pivot back to wide
df %>% mutate(id = row_number()) %>% 
  pivot_longer(starts_with("profile"),names_prefix = "profile") %>% 
  mutate(value = str_remove(value,"^\d+[.] ")) %>% 
  inner_join(lookup, by=c("value" = "text")) %>% 
  pivot_wider(id_cols = id, names_from=profile, values_from = name,names_sort = TRUE,names_prefix = "profile") %>% 
  select(-id)

输出:

  profile1 profile2 profile3 profile4 profile5 profile6
  <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
1 4        1        5        3        6        2       
2 5        4        6        1        3        2       
3 2        3        1        6        4        5      

输入(查找table)

structure(list(profile = c("1", "2", "3", "4", "5", "6"), text = c("Tom, center, pitcher", 
"Pete, right, hitter", "Clay, left, hitter", "Tom, right, fielder", 
"Pete, left, fielder", "Clay, center, pitcher")), row.names = c(NA, 
-6L), class = "data.frame")

查找 table 显示如下:

  profile                  text
1       1  Tom, center, pitcher
2       2   Pete, right, hitter
3       3    Clay, left, hitter
4       4   Tom, right, fielder
5       5   Pete, left, fielder
6       6 Clay, center, pitcher
preferences <- data.frame(pref = c("1. Pete, right, hitter\n2. Clay, center, pitcher\n3. Tom, right, fielder\n4. Tom, center, pitcher\n5. Clay, left, hitter\n6. Pete, left, fielder",
"1. Tom, right, fielder\n2. Clay, center, pitcher\n3. Pete, left, fielder\n4. Pete, right, hitter\n5. Tom, center, pitcher\n6. Clay, left, hitter",
"1. Clay, left, hitter\n2. Tom, center, pitcher\n3. Pete, right, hitter\n4. Pete, left, fielder\n5. Clay, center, pitcher\n6. Tom, right, fielder"), stringsAsFactors = F)

profiles <- c(
  "Tom, center, pitcher",
  "Pete, right, hitter",
  "Clay, left, hitter",
  "Tom, right, fielder",
  "Pete, left, fielder",
  "Clay, center, pitcher"
)


df <- data.frame(do.call(rbind, lapply(preferences$pref, function(x) {
  match(
   profiles,
   str_replace_all(strsplit(x, "\n")[[1]], "^[0-9]+. ", "")
  )
})))

names(df) <- paste0("profile", 1:length(profiles))

df

#   profile1 profile2 profile3 profile4 profile5 profile6
# 1        4        1        5        3        6        2
# 2        5        4        6        1        3        2
# 3        2        3        1        6        4        5

在 Base R 中你会做:

首先阅读您的个人资料:

text <- "profile1: Tom, center, pitcher
profile2: Pete, right, hitter
profile3: Clay, left, hitter
profile4: Tom, right, fielder
profile5: Pete, left, fielder
profile6: Clay, center, pitcher"

a <- read.dcf(textConnection(text), all = TRUE)

请注意,如果您的个人资料在文件中,请使用 a <- read.dcf('file.name', all = TRUE)

b <- strsplit(gsub("\d+..", '', preferences$pref), '\n') 
setNames(data.frame(t(mapply(match, list(a), b))), names(a))

 profile1 profile2 profile3 profile4 profile5 profile6
1        4        1        5        3        6        2
2        5        4        6        1        3        2
3        2        3        1        6        4        5