R 如何提取基于 n-gram 的行
R how to extract n-grams based rows
我有一个数据框df
:
userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta
3108 -8.00 Easy Easy Easy Easy
3207 3.00 Hard Easy Match Match
3350 5.78 Hard Easy Hard Hard
3961 10.00 Easy Easy Hard Hard
1. userID is factor variable
2. Score is numeric
3. All the 'Task_' features are factor variables with possible values 'Hard', 'Easy', 'Match'
我想查看转换(Task_alpha
、Task_beta
、Task_Charlie
、Task_Delta
)和分数之间可能存在的关联。
我的假设是 2-gram
或 bi-gram
序列 Hard Hard
可能与更高的分数相关联。但是,序列 Easy Easy
将与较低的分数相关。
在这个例子中我只考虑了2-gram
。在我的实际代码中,我也想尝试更长的序列。仅供参考,您可以看到我们可以拥有的 bi-grams
总数是:
Easy Hard
Hard Easy
Easy Match
Match Easy
Hard Match
Match Hard
问题: 作为第一步,我需要的整体输出类似于:
Task Task Score
Easy Easy -8.00
Easy Easy -8.00
Easy Easy -8.00
Hard Easy 3.00
Easy Match 3.00
Match Match 3.00
Hard Easy 5.78
Easy Hard 5.78
Hard Hard 5.78
Easy Easy 10.00
Easy Hard 10.00
Hard Hard 10.00
首先,您需要将所有因子转换为字符(否则,在下一步中,R 将使用它们的索引而不是使用因子的值)。
一个选项dplyr
:
library(dplyr)
df <- df %>% mutate_if(is.factor, as.character)
那么你可以这样做:
data.frame(Task1 = c(df[, 3], df[, 4], df[, 5]),
Task2 = c(df[, 4], df[, 5], df[, 6]),
Score = rep(df[, 2], 3)) %>%
arrange(Score)
输出:
Task1 Task2 Score
1 Easy Easy -8.00
2 Easy Easy -8.00
3 Easy Easy -8.00
4 Hard Easy 3.00
5 Easy Match 3.00
6 Match Match 3.00
7 Hard Easy 5.78
8 Easy Hard 5.78
9 Hard Hard 5.78
10 Easy Easy 10.00
11 Easy Hard 10.00
12 Hard Hard 10.00
我已经能够解决这个问题如下:
第 1 步:
作为第一步,我连接了列:
df$all = paste(df$Task_Alpha,
df$Task_Beta,
df$Task_Charlie,
df$Task_Delta,
sep="-")
userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta all
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard
第 2 步:
作为第二步(为了获得更通用的解决方案),我尝试了基于 n-gram
的方法。我尝试将字符串拆分成任何大小 n-gram
我想要
library(tidytext)
library(dplyr)
df = as_tibble(df)
df_test = df %>%
unnest_tokens(bigram, all, token = "ngrams", n = 2)
这给了我输出:
userID Score Task_*A* Task_*B* Task_*C* Task_*D* all bigram
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match hard easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match easy match
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match match match
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard hard easy
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard easy hard
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard hard hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy easy
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard hard hard
第 3 步:
这个解决方案满足我的要求,即使我想增加克数。例如,对于 3-gram
我可以通过以下方式简单地实现:
df = as_tibble(df)
df_test = df %>%
unnest_tokens(trigram, all, token = "ngrams", n = 3)
将产生:
userID Score Task_*A* Task_*B* Task_*C* Task_*D* all trigram
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy easy
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match hard easy match
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match easy match match
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard hard easy hard
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard easy hard hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy easy hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy hard hard
我有一个数据框df
:
userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta
3108 -8.00 Easy Easy Easy Easy
3207 3.00 Hard Easy Match Match
3350 5.78 Hard Easy Hard Hard
3961 10.00 Easy Easy Hard Hard
1. userID is factor variable
2. Score is numeric
3. All the 'Task_' features are factor variables with possible values 'Hard', 'Easy', 'Match'
我想查看转换(Task_alpha
、Task_beta
、Task_Charlie
、Task_Delta
)和分数之间可能存在的关联。
我的假设是 2-gram
或 bi-gram
序列 Hard Hard
可能与更高的分数相关联。但是,序列 Easy Easy
将与较低的分数相关。
在这个例子中我只考虑了2-gram
。在我的实际代码中,我也想尝试更长的序列。仅供参考,您可以看到我们可以拥有的 bi-grams
总数是:
Easy Hard
Hard Easy
Easy Match
Match Easy
Hard Match
Match Hard
问题: 作为第一步,我需要的整体输出类似于:
Task Task Score
Easy Easy -8.00
Easy Easy -8.00
Easy Easy -8.00
Hard Easy 3.00
Easy Match 3.00
Match Match 3.00
Hard Easy 5.78
Easy Hard 5.78
Hard Hard 5.78
Easy Easy 10.00
Easy Hard 10.00
Hard Hard 10.00
首先,您需要将所有因子转换为字符(否则,在下一步中,R 将使用它们的索引而不是使用因子的值)。
一个选项dplyr
:
library(dplyr)
df <- df %>% mutate_if(is.factor, as.character)
那么你可以这样做:
data.frame(Task1 = c(df[, 3], df[, 4], df[, 5]),
Task2 = c(df[, 4], df[, 5], df[, 6]),
Score = rep(df[, 2], 3)) %>%
arrange(Score)
输出:
Task1 Task2 Score
1 Easy Easy -8.00
2 Easy Easy -8.00
3 Easy Easy -8.00
4 Hard Easy 3.00
5 Easy Match 3.00
6 Match Match 3.00
7 Hard Easy 5.78
8 Easy Hard 5.78
9 Hard Hard 5.78
10 Easy Easy 10.00
11 Easy Hard 10.00
12 Hard Hard 10.00
我已经能够解决这个问题如下:
第 1 步: 作为第一步,我连接了列:
df$all = paste(df$Task_Alpha,
df$Task_Beta,
df$Task_Charlie,
df$Task_Delta,
sep="-")
userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta all
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard
第 2 步:
作为第二步(为了获得更通用的解决方案),我尝试了基于 n-gram
的方法。我尝试将字符串拆分成任何大小 n-gram
我想要
library(tidytext)
library(dplyr)
df = as_tibble(df)
df_test = df %>%
unnest_tokens(bigram, all, token = "ngrams", n = 2)
这给了我输出:
userID Score Task_*A* Task_*B* Task_*C* Task_*D* all bigram
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match hard easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match easy match
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match match match
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard hard easy
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard easy hard
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard hard hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy easy
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard hard hard
第 3 步:
这个解决方案满足我的要求,即使我想增加克数。例如,对于 3-gram
我可以通过以下方式简单地实现:
df = as_tibble(df)
df_test = df %>%
unnest_tokens(trigram, all, token = "ngrams", n = 3)
将产生:
userID Score Task_*A* Task_*B* Task_*C* Task_*D* all trigram
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy easy
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match hard easy match
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match easy match match
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard hard easy hard
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard easy hard hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy easy hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy hard hard