Table 的 n-gram 并识别文本出现的行
Table of n-grams and identifying the row in which the text appeared
我想构建一个 table,其中 n-gram 显示为一列和构建它们的数据帧的行号。
例如,下面的代码用于构造 n-gram(在本例中为四元组):
# Libraries
library(quanteda)
library(data.table)
library(tidyverse)
library(stringr)
# Dataframe
Data <- data.frame(Column1 = c(1.222, 3.445, 5.621, 8.501, 9.302),
Column2 = c(654231, 12347, -2365, 90000, 12897),
Column3 = c('A1', 'B2', 'E3', 'C1', 'F5'),
Column4 = c('I bought it', 'The flower has a beautiful fragrance', 'It was bought by me', 'I have bought it', 'The flower smells good'),
Column5 = c('Good', 'Bad', 'Ok', 'Moderate', 'Perfect'))
# Text column of interest
TextColumn <- Data$Column4
# Corpus
Content <- corpus(TextColumn)
# Tokenization
Tokens <- tokens(Content, what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = FALSE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE,
padding = FALSE)
Tokens <- tokens_tolower(Tokens)
# n-grams
quadgrams <- dfm(tokens_ngrams(Tokens, n = 4))
quadgrams_freq <- textstat_frequency(quadgrams) # quadgram frequency
quadgrs <- subset(quadgrams_freq,select=c(feature,frequency))
names(quadgrs) <- c("ngram","freq")
quadgrs <- as.data.table(quadgrs)
结果是
有没有办法从 Column4 中提取单词的行号。例如,包含 2(行号)的列必须在上面 table 中对应于“the_flower_has_a”,并且再次将 2(行号)作为“flower_has_a_beautiful”的条目等.
您可以在textstat_frequency()
中指定一个与组名相对应的组,这将提供对您原始“行号”的引用。
library("quanteda")
## Package version: 2.1.2
library("data.table")
# Dataframe
Data <- data.frame(
Column1 = c(1.222, 3.445, 5.621, 8.501, 9.302),
Column2 = c(654231, 12347, -2365, 90000, 12897),
Column3 = c("A1", "B2", "E3", "C1", "F5"),
Column4 = c("I bought it", "The flower has a beautiful fragrance", "It was bought by me", "I have bought it", "The flower smells good"),
Column5 = c("Good", "Bad", "Ok", "Moderate", "Perfect")
)
# Corpus
Content <- corpus(Data, text_field = "Column4")
docnames(Content) <- seq_len(nrow(Data))
# Tokenization and ngrams
Tokens <- tokens(Content,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE
) %>%
tokens_tolower() %>%
tokens_ngrams(n = 4)
现在是小组部分:
# form the result
quadgrs <- textstat_frequency(dfm(Tokens), groups = docnames(Tokens)) %>%
as.data.table()
setnames(quadgrs, "group", "rownumber")
quadgrs[, c("feature", "frequency", "rownumber")]
## feature frequency rownumber
## 1: the_flower_has_a 1 2
## 2: flower_has_a_beautiful 1 2
## 3: has_a_beautiful_fragrance 1 2
## 4: it_was_bought_by 1 3
## 5: was_bought_by_me 1 3
## 6: i_have_bought_it 1 4
## 7: the_flower_smells_good 1 5
注意:
- 我稍微简化了您的代码,因为其中一些代码是不必要的或可以简化的。
- 频率计数现在在行(文档)内,因此如果您在多行中有相同的 ngram,它将在输出 table 中出现不止一次,频率在行内。如果您想重复出现在多行中的 ngram 的总体频率,则可以轻松修改此代码以反映这一点。 (如果你想要那个,请告诉我。)
我想构建一个 table,其中 n-gram 显示为一列和构建它们的数据帧的行号。
例如,下面的代码用于构造 n-gram(在本例中为四元组):
# Libraries
library(quanteda)
library(data.table)
library(tidyverse)
library(stringr)
# Dataframe
Data <- data.frame(Column1 = c(1.222, 3.445, 5.621, 8.501, 9.302),
Column2 = c(654231, 12347, -2365, 90000, 12897),
Column3 = c('A1', 'B2', 'E3', 'C1', 'F5'),
Column4 = c('I bought it', 'The flower has a beautiful fragrance', 'It was bought by me', 'I have bought it', 'The flower smells good'),
Column5 = c('Good', 'Bad', 'Ok', 'Moderate', 'Perfect'))
# Text column of interest
TextColumn <- Data$Column4
# Corpus
Content <- corpus(TextColumn)
# Tokenization
Tokens <- tokens(Content, what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = FALSE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE,
padding = FALSE)
Tokens <- tokens_tolower(Tokens)
# n-grams
quadgrams <- dfm(tokens_ngrams(Tokens, n = 4))
quadgrams_freq <- textstat_frequency(quadgrams) # quadgram frequency
quadgrs <- subset(quadgrams_freq,select=c(feature,frequency))
names(quadgrs) <- c("ngram","freq")
quadgrs <- as.data.table(quadgrs)
结果是
有没有办法从 Column4 中提取单词的行号。例如,包含 2(行号)的列必须在上面 table 中对应于“the_flower_has_a”,并且再次将 2(行号)作为“flower_has_a_beautiful”的条目等.
您可以在textstat_frequency()
中指定一个与组名相对应的组,这将提供对您原始“行号”的引用。
library("quanteda")
## Package version: 2.1.2
library("data.table")
# Dataframe
Data <- data.frame(
Column1 = c(1.222, 3.445, 5.621, 8.501, 9.302),
Column2 = c(654231, 12347, -2365, 90000, 12897),
Column3 = c("A1", "B2", "E3", "C1", "F5"),
Column4 = c("I bought it", "The flower has a beautiful fragrance", "It was bought by me", "I have bought it", "The flower smells good"),
Column5 = c("Good", "Bad", "Ok", "Moderate", "Perfect")
)
# Corpus
Content <- corpus(Data, text_field = "Column4")
docnames(Content) <- seq_len(nrow(Data))
# Tokenization and ngrams
Tokens <- tokens(Content,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE
) %>%
tokens_tolower() %>%
tokens_ngrams(n = 4)
现在是小组部分:
# form the result
quadgrs <- textstat_frequency(dfm(Tokens), groups = docnames(Tokens)) %>%
as.data.table()
setnames(quadgrs, "group", "rownumber")
quadgrs[, c("feature", "frequency", "rownumber")]
## feature frequency rownumber
## 1: the_flower_has_a 1 2
## 2: flower_has_a_beautiful 1 2
## 3: has_a_beautiful_fragrance 1 2
## 4: it_was_bought_by 1 3
## 5: was_bought_by_me 1 3
## 6: i_have_bought_it 1 4
## 7: the_flower_smells_good 1 5
注意:
- 我稍微简化了您的代码,因为其中一些代码是不必要的或可以简化的。
- 频率计数现在在行(文档)内,因此如果您在多行中有相同的 ngram,它将在输出 table 中出现不止一次,频率在行内。如果您想重复出现在多行中的 ngram 的总体频率,则可以轻松修改此代码以反映这一点。 (如果你想要那个,请告诉我。)