使用 mutate 获取 ngram 的数量
Using mutate to get number of ngrams
我正在使用 dplyr 解析包含句子的列并计算每个句子的 ngram 数量。这是一个演示我 运行 遇到的问题的示例。
如您所见,人们期望 ngram_cnt 为 3 和 4,但结果是列有 3,3。问题是代码 returns 第一句话的 ngram 数量,忽略其余部分。您可以尝试添加更多句子,具有相同的效果。我做错了什么?
library(NLP)
library(dplyr)
library(stringr)
phrases <- c("this is the first", "and then comes the second")
df <- data.frame(phrase = phrases, id = c(1, 2))
df %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\s")[[1]], 2)))
如果我说,
phrases <- c("this is the first", "and then comes the second",
"and the third which is even longer")
df <- data.frame(phrase = phrases, id = c(1, 2, 3))
df %>% mutate(ngram_cnt = str_length(phrase))
然后我得到了预期的结果(即每个句子的长度)。
那是因为
df %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\s")[[1]], 2)))
[[1]]
select 只有第一句的拆分
这与 :
相同
length(ngrams(str_split(phrases, "\s")[[1]], 2))
# [1] 3
并且在 mutate
之后将 3
放入每一行
phrases <- c("this is the first", "and then comes the second")
df <- data.frame(phrase = phrases, id = c(1, 2))
library("dplyr")
您可以使用 rowwise
按行应用计算:
df %>% rowwise() %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\s")[[1]], n = 2)))
# Source: local data frame [2 x 3]
# Groups: <by row>
#
# phrase id ngram_cnt
# (fctr) (dbl) (int)
# 1 this is the first 1 3
# 2 and then comes the second 2 4
或者如果您的 ID 是唯一的,则使用 group_by
:
df %>% group_by(id) %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\s")[[1]], n = 2)))
# Source: local data frame [2 x 3]
# Groups: id [2]
#
# phrase id ngram_cnt
# (fctr) (dbl) (int)
# 1 this is the first 1 3
# 2 and then comes the second 2 4
或者您可以向量化计算 ngram 长度的函数:
length_ngrams <- function(x) {
length(ngrams(str_split(x, "\s")[[1]], n = 2))
}
length_ngrams <- Vectorize(length_ngrams)
df %>% mutate(ngram_cnt = length_ngrams(phrase))
# phrase id ngram_cnt
# 1 this is the first 1 3
# 2 and then comes the second 2 4
我正在使用 dplyr 解析包含句子的列并计算每个句子的 ngram 数量。这是一个演示我 运行 遇到的问题的示例。
如您所见,人们期望 ngram_cnt 为 3 和 4,但结果是列有 3,3。问题是代码 returns 第一句话的 ngram 数量,忽略其余部分。您可以尝试添加更多句子,具有相同的效果。我做错了什么?
library(NLP)
library(dplyr)
library(stringr)
phrases <- c("this is the first", "and then comes the second")
df <- data.frame(phrase = phrases, id = c(1, 2))
df %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\s")[[1]], 2)))
如果我说,
phrases <- c("this is the first", "and then comes the second",
"and the third which is even longer")
df <- data.frame(phrase = phrases, id = c(1, 2, 3))
df %>% mutate(ngram_cnt = str_length(phrase))
然后我得到了预期的结果(即每个句子的长度)。
那是因为
df %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\s")[[1]], 2)))
[[1]]
select 只有第一句的拆分
这与 :
length(ngrams(str_split(phrases, "\s")[[1]], 2))
# [1] 3
并且在 mutate
之后将 3
放入每一行
phrases <- c("this is the first", "and then comes the second")
df <- data.frame(phrase = phrases, id = c(1, 2))
library("dplyr")
您可以使用 rowwise
按行应用计算:
df %>% rowwise() %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\s")[[1]], n = 2)))
# Source: local data frame [2 x 3]
# Groups: <by row>
#
# phrase id ngram_cnt
# (fctr) (dbl) (int)
# 1 this is the first 1 3
# 2 and then comes the second 2 4
或者如果您的 ID 是唯一的,则使用 group_by
:
df %>% group_by(id) %>% mutate(ngram_cnt = length(ngrams(str_split(phrase, "\s")[[1]], n = 2)))
# Source: local data frame [2 x 3]
# Groups: id [2]
#
# phrase id ngram_cnt
# (fctr) (dbl) (int)
# 1 this is the first 1 3
# 2 and then comes the second 2 4
或者您可以向量化计算 ngram 长度的函数:
length_ngrams <- function(x) {
length(ngrams(str_split(x, "\s")[[1]], n = 2))
}
length_ngrams <- Vectorize(length_ngrams)
df %>% mutate(ngram_cnt = length_ngrams(phrase))
# phrase id ngram_cnt
# 1 this is the first 1 3
# 2 and then comes the second 2 4