在 R 中,结合单个字数和字典字数
In R, combing individual word count and dictionary word count
我需要计算文档中的字数。在某些情况下,我需要计算特定单词的数量(例如“新鲜”),在其他情况下我需要计算一组单词的总数(“philadelphia”,“aunt”)。
我知道如何分两步执行此操作(请参阅下面的代码),但如何同时执行此操作?
下面的代码计算特定的单词。
library("quanteda")
txt <- "In west Philadelphia born and raised On the playground was where I spent most of my days Chillin' out maxin' relaxin' all cool And all shootin some b-ball outside of the school When a couple of guys who were up to no good Started making trouble in my neighborhood I got in one little fight and my mom got scared."
tokens(txt) %>% tokens_select(c("trouble", "fight")) %>% dfm()
输出为:
trouble, fight
1, 1
下面的代码计算字典单词并将总计数写入一列。
mydict <- dictionary(list(all_terms = c("chillin", "relaxin", "shootin")))
count <-dfm(txt,dictionary = mydict)
输出为:
all_terms
3
如何将两者结合起来?
我想要这样的东西:(代码是假设的,不起作用)
tokens(txt) %>% tokens_select(c("trouble", "fight"), mydict) %>% dfm()
或
tokens(txt) %>% tokens_select(c("trouble", "fight"), all_terms=c("chillin","relaxin","shootin")) %>% dfm()
期望的输出:
trouble, fight, all_terms
1, 1, 3
简洁重要吗,即将所有内容都放在一行中?如果没有,一个解决方案是从 dfm 对象中提取数据,然后组合成您想要的形式 - 矩阵,data.frame,tibble。
library("quanteda")
library(magritte) # for the pipe
txt <- "In west Philadelphia born and raised On the playground was where I spent most of my days Chillin' out maxin' relaxin' all cool And all shootin some b-ball outside of the school When a couple of guys who were up to no good Started making trouble in my neighborhood I got in one little fight and my mom got scared."
mydict <- dictionary(list(all_terms = c("chillin", "relaxin", "shootin")))
first <- dfm(tokens_select(tokens(txt), c("trouble", "fight")))
second <- dfm(txt,dictionary = mydict)
# These are the outputs you're after
first@Dimnames$features
first@x
second@Dimnames$features
second@x
# Combine into a matrix
matrix(c(first@Dimnames$features, second@Dimnames$features), ncol = 3) %>%
rbind(c(first@x, second@x))
# Or make two vectors for use elsewhere
paste(c(first@Dimnames$features, second@Dimnames$features), collapse = ", ")
paste(c(first@x, second@x), collapse = ", ")
有几种方法,这可能是最简单的一种。定义一个字典,其中键等于每个特定单词的单词值,以及一组单词的组键——在您的示例中,“all_terms”。
library("quanteda")
## Package version: 2.1.2
txt <- "In west Philadelphia born and raised On the playground was where I spent most of my days Chillin' out maxin' relaxin' all cool And all shootin some b-ball outside of the school When a couple of guys who were up to no good Started making trouble in my neighborhood I got in one little fight and my mom got scared."
dict <- dictionary(list(
trouble = "trouble",
fight = "fight",
all_terms = c("chillin", "relaxin", "shootin")
))
现在当你编译dfm时,你会得到你想要的。
dfmat <- dfm(txt, dictionary = dict)
dfmat
## Document-feature matrix of: 1 document, 3 features (0.0% sparse).
## features
## docs trouble fight all_terms
## text1 1 1 3
要将其强制转换为更简单的对象,包括您列出的输出,您可以这样做:
# as a named numeric vector
structure(as.vector(dfmat), names = featnames(dfmat))
## trouble fight all_terms
## 1 1 3
# per your output
cat(
paste(featnames(dfmat), collapse = ", "), "\n",
paste(as.vector(dfmat), collapse = ", ")
)
## trouble, fight, all_terms
## 1, 1, 3
请注意,直接访问对象内部结构不是一个好主意(如另一个答案)。请改用 featnames()
等提取函数。
已添加:
另一种不创建项目命名列表的方法:
dict <- dictionary(list(all_terms = c("chillin", "relaxin", "shootin")))
single_words <- c("trouble", "fight")
tokens(txt) %>%
tokens_lookup(dictionary = dict, exclusive = FALSE) %>%
tokens_keep(pattern = c(names(dict), single_words)) %>%
dfm()
## Document-feature matrix of: 1 document, 3 features (0.0% sparse).
## features
## docs all_terms trouble fight
## text1 3 1 1
这是我在评论中建议的。
> library("quanteda")
> txt <- "In west Philadelphia born and raised On the playground was where I spent most of my days Chillin' out maxin' relaxin' all cool And all shootin some b-ball outside of the school When a couple of guys who were up to no good Started making trouble in my neighborhood I got in one little fight and my mom got scared."
> dict <- dictionary(list(all_terms = c("chillin", "relaxin", "shootin")))
> dfmt <- dfm(txt)
> dfmt_dict <- dfm_lookup(dfmt, dict, exclusive = FALSE, cap = FALSE)
> topfeatures(dfmt_dict)
in and of my all_terms ' the i
3 3 3 3 3 3 2 2
all got
2 2
我需要计算文档中的字数。在某些情况下,我需要计算特定单词的数量(例如“新鲜”),在其他情况下我需要计算一组单词的总数(“philadelphia”,“aunt”)。
我知道如何分两步执行此操作(请参阅下面的代码),但如何同时执行此操作?
下面的代码计算特定的单词。
library("quanteda")
txt <- "In west Philadelphia born and raised On the playground was where I spent most of my days Chillin' out maxin' relaxin' all cool And all shootin some b-ball outside of the school When a couple of guys who were up to no good Started making trouble in my neighborhood I got in one little fight and my mom got scared."
tokens(txt) %>% tokens_select(c("trouble", "fight")) %>% dfm()
输出为:
trouble, fight
1, 1
下面的代码计算字典单词并将总计数写入一列。
mydict <- dictionary(list(all_terms = c("chillin", "relaxin", "shootin")))
count <-dfm(txt,dictionary = mydict)
输出为:
all_terms
3
如何将两者结合起来?
我想要这样的东西:(代码是假设的,不起作用)
tokens(txt) %>% tokens_select(c("trouble", "fight"), mydict) %>% dfm()
或
tokens(txt) %>% tokens_select(c("trouble", "fight"), all_terms=c("chillin","relaxin","shootin")) %>% dfm()
期望的输出:
trouble, fight, all_terms
1, 1, 3
简洁重要吗,即将所有内容都放在一行中?如果没有,一个解决方案是从 dfm 对象中提取数据,然后组合成您想要的形式 - 矩阵,data.frame,tibble。
library("quanteda")
library(magritte) # for the pipe
txt <- "In west Philadelphia born and raised On the playground was where I spent most of my days Chillin' out maxin' relaxin' all cool And all shootin some b-ball outside of the school When a couple of guys who were up to no good Started making trouble in my neighborhood I got in one little fight and my mom got scared."
mydict <- dictionary(list(all_terms = c("chillin", "relaxin", "shootin")))
first <- dfm(tokens_select(tokens(txt), c("trouble", "fight")))
second <- dfm(txt,dictionary = mydict)
# These are the outputs you're after
first@Dimnames$features
first@x
second@Dimnames$features
second@x
# Combine into a matrix
matrix(c(first@Dimnames$features, second@Dimnames$features), ncol = 3) %>%
rbind(c(first@x, second@x))
# Or make two vectors for use elsewhere
paste(c(first@Dimnames$features, second@Dimnames$features), collapse = ", ")
paste(c(first@x, second@x), collapse = ", ")
有几种方法,这可能是最简单的一种。定义一个字典,其中键等于每个特定单词的单词值,以及一组单词的组键——在您的示例中,“all_terms”。
library("quanteda")
## Package version: 2.1.2
txt <- "In west Philadelphia born and raised On the playground was where I spent most of my days Chillin' out maxin' relaxin' all cool And all shootin some b-ball outside of the school When a couple of guys who were up to no good Started making trouble in my neighborhood I got in one little fight and my mom got scared."
dict <- dictionary(list(
trouble = "trouble",
fight = "fight",
all_terms = c("chillin", "relaxin", "shootin")
))
现在当你编译dfm时,你会得到你想要的。
dfmat <- dfm(txt, dictionary = dict)
dfmat
## Document-feature matrix of: 1 document, 3 features (0.0% sparse).
## features
## docs trouble fight all_terms
## text1 1 1 3
要将其强制转换为更简单的对象,包括您列出的输出,您可以这样做:
# as a named numeric vector
structure(as.vector(dfmat), names = featnames(dfmat))
## trouble fight all_terms
## 1 1 3
# per your output
cat(
paste(featnames(dfmat), collapse = ", "), "\n",
paste(as.vector(dfmat), collapse = ", ")
)
## trouble, fight, all_terms
## 1, 1, 3
请注意,直接访问对象内部结构不是一个好主意(如另一个答案)。请改用 featnames()
等提取函数。
已添加:
另一种不创建项目命名列表的方法:
dict <- dictionary(list(all_terms = c("chillin", "relaxin", "shootin")))
single_words <- c("trouble", "fight")
tokens(txt) %>%
tokens_lookup(dictionary = dict, exclusive = FALSE) %>%
tokens_keep(pattern = c(names(dict), single_words)) %>%
dfm()
## Document-feature matrix of: 1 document, 3 features (0.0% sparse).
## features
## docs all_terms trouble fight
## text1 3 1 1
这是我在评论中建议的。
> library("quanteda")
> txt <- "In west Philadelphia born and raised On the playground was where I spent most of my days Chillin' out maxin' relaxin' all cool And all shootin some b-ball outside of the school When a couple of guys who were up to no good Started making trouble in my neighborhood I got in one little fight and my mom got scared."
> dict <- dictionary(list(all_terms = c("chillin", "relaxin", "shootin")))
> dfmt <- dfm(txt)
> dfmt_dict <- dfm_lookup(dfmt, dict, exclusive = FALSE, cap = FALSE)
> topfeatures(dfmt_dict)
in and of my all_terms ' the i
3 3 3 3 3 3 2 2
all got
2 2