从 R 中的文本中提取字符级 n-gram
Extract character-level n-grams from text in R
我有一个带有文本的数据框,我想提取字符级双字母组 (n = 2),例如"st"、"ac"、"ck",对于 R 中的每个文本。
我也想统计每个字符级别的二元组在文本中出现的频率
数据:
df$text
[1] "hy my name is"
[2] "stackover flow is great"
[3] "how are you"
我不太确定您在这里的预期输出。我原以为 "stack" 的二元组会是 "st"、"ta"、"ac" 和 "ck",因为这会捕获每个连续的对。
例如,如果你想知道单词 "brothers" 中有多少个二元组 "th" 的实例,你将其拆分为二元组 "br"、"ot"、"he" 和 "rs",那么你会得到答案 0,这是错误的。
您可以构建一个函数来获取 所有 个二元语法,如下所示:
# This function takes a vector of single characters and creates all the bigrams
# within that vector. For example "s", "t", "a", "c", "k" becomes
# "st", "ta", "ac", and "ck"
pair_chars <- function(char_vec) {
all_pairs <- paste0(char_vec[-length(char_vec)], char_vec[-1])
return(as.vector(all_pairs[nchar(all_pairs) == 2]))
}
# This function splits a single word into a character vector and gets its bigrams
word_bigrams <- function(words){
unlist(lapply(strsplit(words, ""), pair_chars))
}
# This function splits a string or vector of strings into words and gets their bigrams
string_bigrams <- function(strings){
unlist(lapply(strsplit(strings, " "), word_bigrams))
}
所以现在我们可以在您的示例上进行测试:
df <- data.frame(text = c("hy my name is", "stackover flow is great",
"how are you"), stringsAsFactors = FALSE)
string_bigrams(df$text)
#> [1] "hy" "my" "na" "am" "me" "is" "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "fl"
#> [16] "lo" "ow" "is" "gr" "re" "ea" "at" "ho" "ow" "ar" "re" "yo" "ou"
如果你想计算出现次数,你可以使用table
:
table(string_bigrams(df$text))
#> ac am ar at ck ea er fl gr ho hy is ko lo me my na ou ov ow re st ta ve yo
#> 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1
但是,如果您要进行大量的文本挖掘,您应该查看特定的 R 包,例如 stringi
、stringr
、tm
和 quanteda
有助于完成基本任务
例如,我上面写的所有基本 R 函数都可以使用 quanteda
包替换,如下所示:
library(quanteda)
char_ngrams(unlist(tokens(df$text, "character")), concatenator = "")
#> [1] "hy" "ym" "my" "yn" "na" "am" "me" "ei" "is" "ss" "st" "ta" "ac" "ck"
#> [15] "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow" "wi" "is" "sg" "gr" "re" "ea"
#> [29] "at" "th" "ho" "ow" "wa" "ar" "re" "ey" "yo" "ou"
由 reprex package (v0.3.0)
于 2020-06-13 创建
除了艾伦的回答,
您可以结合使用 stringdist 包中的 qgram
函数和 gsub
来删除空格。
library(stringdist)
qgrams(gsub(" ", "", df1$text), q = 2)
hy ym yn yo my na st ta ve wi wa ov rf sg ow re ou me is ko lo am ei er fl gr ho ey ck ea at ar ac
V1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
我有一个带有文本的数据框,我想提取字符级双字母组 (n = 2),例如"st"、"ac"、"ck",对于 R 中的每个文本。
我也想统计每个字符级别的二元组在文本中出现的频率
数据:
df$text
[1] "hy my name is"
[2] "stackover flow is great"
[3] "how are you"
我不太确定您在这里的预期输出。我原以为 "stack" 的二元组会是 "st"、"ta"、"ac" 和 "ck",因为这会捕获每个连续的对。
例如,如果你想知道单词 "brothers" 中有多少个二元组 "th" 的实例,你将其拆分为二元组 "br"、"ot"、"he" 和 "rs",那么你会得到答案 0,这是错误的。
您可以构建一个函数来获取 所有 个二元语法,如下所示:
# This function takes a vector of single characters and creates all the bigrams
# within that vector. For example "s", "t", "a", "c", "k" becomes
# "st", "ta", "ac", and "ck"
pair_chars <- function(char_vec) {
all_pairs <- paste0(char_vec[-length(char_vec)], char_vec[-1])
return(as.vector(all_pairs[nchar(all_pairs) == 2]))
}
# This function splits a single word into a character vector and gets its bigrams
word_bigrams <- function(words){
unlist(lapply(strsplit(words, ""), pair_chars))
}
# This function splits a string or vector of strings into words and gets their bigrams
string_bigrams <- function(strings){
unlist(lapply(strsplit(strings, " "), word_bigrams))
}
所以现在我们可以在您的示例上进行测试:
df <- data.frame(text = c("hy my name is", "stackover flow is great",
"how are you"), stringsAsFactors = FALSE)
string_bigrams(df$text)
#> [1] "hy" "my" "na" "am" "me" "is" "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "fl"
#> [16] "lo" "ow" "is" "gr" "re" "ea" "at" "ho" "ow" "ar" "re" "yo" "ou"
如果你想计算出现次数,你可以使用table
:
table(string_bigrams(df$text))
#> ac am ar at ck ea er fl gr ho hy is ko lo me my na ou ov ow re st ta ve yo
#> 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1
但是,如果您要进行大量的文本挖掘,您应该查看特定的 R 包,例如 stringi
、stringr
、tm
和 quanteda
有助于完成基本任务
例如,我上面写的所有基本 R 函数都可以使用 quanteda
包替换,如下所示:
library(quanteda)
char_ngrams(unlist(tokens(df$text, "character")), concatenator = "")
#> [1] "hy" "ym" "my" "yn" "na" "am" "me" "ei" "is" "ss" "st" "ta" "ac" "ck"
#> [15] "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow" "wi" "is" "sg" "gr" "re" "ea"
#> [29] "at" "th" "ho" "ow" "wa" "ar" "re" "ey" "yo" "ou"
由 reprex package (v0.3.0)
于 2020-06-13 创建除了艾伦的回答,
您可以结合使用 stringdist 包中的 qgram
函数和 gsub
来删除空格。
library(stringdist)
qgrams(gsub(" ", "", df1$text), q = 2)
hy ym yn yo my na st ta ve wi wa ov rf sg ow re ou me is ko lo am ei er fl gr ho ey ck ea at ar ac
V1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1