R 如何提取基于 n-gram 的行

R how to extract n-grams based rows

我有一个数据框df:

userID Score  Task_Alpha Task_Beta Task_Charlie Task_Delta 
3108   -8.00  Easy       Easy      Easy         Easy
3207    3.00  Hard       Easy      Match        Match
3350    5.78  Hard       Easy      Hard         Hard
3961    10.00 Easy       Easy      Hard         Hard


1. userID is factor variable
2. Score is numeric
3. All the 'Task_' features are factor variables with possible values 'Hard', 'Easy', 'Match'

我想查看转换(Task_alphaTask_betaTask_CharlieTask_Delta)和分数之间可能存在的关联。

我的假设是 2-grambi-gram 序列 Hard Hard 可能与更高的分数相关联。但是,序列 Easy Easy 将与较低的分数相关。

在这个例子中我只考虑了2-gram。在我的实际代码中,我也想尝试更长的序列。仅供参考,您可以看到我们可以拥有的 bi-grams 总数是:

Easy Hard
Hard Easy
Easy Match
Match Easy
Hard Match
Match Hard

问题: 作为第一步,我需要的整体输出类似于:

Task   Task  Score 
Easy   Easy -8.00
Easy   Easy -8.00
Easy   Easy -8.00
Hard   Easy  3.00
Easy  Match  3.00
Match Match  3.00
Hard   Easy  5.78
Easy   Hard  5.78
Hard   Hard  5.78
Easy   Easy  10.00
Easy   Hard  10.00
Hard   Hard  10.00

首先,您需要将所有因子转换为字符(否则,在下一步中,R 将使用它们的索引而不是使用因子的值)。

一个选项dplyr

library(dplyr)

df <- df %>% mutate_if(is.factor, as.character)

那么你可以这样做:

data.frame(Task1 = c(df[, 3], df[, 4], df[, 5]),
           Task2 = c(df[, 4], df[, 5], df[, 6]),
           Score = rep(df[, 2], 3)) %>%
  arrange(Score)

输出:

   Task1 Task2 Score
1   Easy  Easy -8.00
2   Easy  Easy -8.00
3   Easy  Easy -8.00
4   Hard  Easy  3.00
5   Easy Match  3.00
6  Match Match  3.00
7   Hard  Easy  5.78
8   Easy  Hard  5.78
9   Hard  Hard  5.78
10  Easy  Easy 10.00
11  Easy  Hard 10.00
12  Hard  Hard 10.00

我已经能够解决这个问题如下:

第 1 步: 作为第一步,我连接了列:

 df$all = paste(df$Task_Alpha,
              df$Task_Beta,
              df$Task_Charlie,
              df$Task_Delta,
              sep="-")
userID  Score Task_Alpha Task_Beta Task_Charlie Task_Delta all
3108   -8.00  Easy       Easy      Easy         Easy       Easy-Easy-Easy-Easy
3207    3.00  Hard       Easy      Match        Match      Hard-Easy-Match-Match
3350    5.78  Hard       Easy      Hard         Hard       Hard-Easy-Hard-Hard
3961    10.00 Easy       Easy      Hard         Hard       Easy-Easy-Hard-Hard

第 2 步: 作为第二步(为了获得更通用的解决方案),我尝试了基于 n-gram 的方法。我尝试将字符串拆分成任何大小 n-gram 我想要

library(tidytext)
library(dplyr)

df = as_tibble(df)
df_test = df %>%
   unnest_tokens(bigram, all, token = "ngrams", n = 2)

这给了我输出:

userID Score Task_*A* Task_*B* Task_*C* Task_*D* all                   bigram
3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy
3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy
3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy
3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match hard easy
3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match easy match
3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match match match
3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   hard easy
3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   easy hard
3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   hard hard
3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy easy
3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy hard
3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   hard hard

第 3 步: 这个解决方案满足我的要求,即使我想增加克数。例如,对于 3-gram 我可以通过以下方式简单地实现:

  df = as_tibble(df)
  df_test = df %>%
    unnest_tokens(trigram, all, token = "ngrams", n = 3)

将产生:

userID Score Task_*A* Task_*B* Task_*C* Task_*D* all                   trigram
3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy easy
3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy easy
3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match hard easy match
3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match easy match match
3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   hard easy hard
3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   easy hard hard
3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy easy hard
3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy hard hard