在 R 中,如何在数据框中找到所有字典单词的位置?
In R, how to find the locations of all dictionary words, in a dataframe?
我正在分析公司会议,我想衡量会议中的人们在什么时间提出某些主题。时间意思是文字的位置。
例如,在三个会议中,人们什么时候会在我的字典中提到“unionizing”和其他词?
df <- data.frame(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))
dict <- c("unions", "strike", "unionizing")
期望的输出:
text
count
word
we're meeting here today...
(location of word)
unionizing
hi all, unionizing an...
(location of word)
unionizing
hi all, unionizing an...
(location of word)
strike
hi all, unionizing an...
(location of word)
unionizing
we will discuss unionizing tomorrow...
(location of word)
unionizing
我问了一个关于查找第一次使用单词的问题,,我尝试修改代码,但没有成功。
library(tidyverse)
library(tidytext)
df <- tibble(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))
dict <- tibble(words = c("unions", "strike", "unionizing"))
df %>%
unnest_tokens(output = "words",
input = "text",
drop = FALSE) %>%
group_by(text) %>%
mutate(word_count = row_number()) %>%
ungroup() %>%
inner_join(dict)
#> Joining, by = "words"
#> # A tibble: 5 × 3
#> text words word_count
#> <chr> <chr> <int>
#> 1 we're meeting here today to talk about our earnings. we will… unio… 14
#> 2 hi all, unionizing and the on-going strike is at the top of … unio… 3
#> 3 hi all, unionizing and the on-going strike is at the top of … stri… 8
#> 4 hi all, unionizing and the on-going strike is at the top of … unio… 17
#> 5 we will discuss unionizing tomorrow, today the focus is our … unio… 4
由 reprex package (v2.0.1)
于 2022-05-30 创建
基础 R 解决方案:
作为每次观察的单个记录:
# Create a regular expression to search with:
# search_regex => character scalar
search_regex <- paste0(
dict,
collapse = "|"
)
# For each observation, loop through and then flatten result into a
# data.frame: res => data.frame
res <- do.call(
rbind,
lapply(
df$text,
function(x){
# Create an ordered vector of the words in observation:
# vec_of_words => character vector
vec_of_words <- unlist(
strsplit(
x,
"\s+"
)
)
# Compute the index where any of the search are found in the vector:
# idx => integer vector
idx <- which(
grepl(
search_regex,
vec_of_words,
ignore.case = TRUE
)
)
# Create a data.frame containing the desired result:
# data.frame => env
data.frame(
# Assign the observation to the text vector:
# text => character vector
text = x,
# Create a string containing the index of matching words:
# count => character vector
count = paste0(
idx,
collapse = ", "
),
# Create a vector of matched words: words => character vector
words = paste0(
vec_of_words[idx],
collapse = ", "
),
row.names = NULL,
stringsAsFactors = FALSE
)
}
)
)
每个匹配词一条新记录:
# Create a regular expression to search with:
# search_regex => character scalar
search_regex <- paste0(
dict,
collapse = "|"
)
# For each observation, loop through and then flatten result into a
# data.frame: res => data.frame
res <- do.call(
rbind,
lapply(
df$text,
function(x){
# Create an ordered vector of the words in observation:
# vec_of_words => character vector
vec_of_words <- unlist(
strsplit(
x,
"\s+"
)
)
# Compute the index where any of the search are found in the vector:
# idx => integer vector
idx <- which(
grepl(
search_regex,
vec_of_words,
ignore.case = TRUE
)
)
# Create a data.frame containing the desired result:
# data.frame => env
data.frame(
# Assign the observation to the text vector:
# text => character vector
text = x,
# Create a string containing the index of matching words:
# count => integer vector
count = idx,
# Create a vector of matched words: words => character vector
words = vec_of_words[idx],
row.names = NULL,
stringsAsFactors = FALSE
)
}
)
)
在Base R中我们可以使用下面的5行代码:
pat <- sprintf("\b(%s)\b",paste(dict, collapse = '|'))
words <- regmatches(df$text, gregexpr(pat, df$text))
loc <- Map(pmatch, words, strsplit(df$text, " "))
df1 <- stack(setNames(words, seq_along(words)))
transform(df1, location = unlist(loc), text = df$text[ind])
values ind location text
1 unionizing 1 14 we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.
2 unionizing 2 3 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
3 strike 2 7 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
4 unionizing 2 16 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
5 unionizing 3 4 we will discuss unionizing tomorrow, today the focus is our Q3 earnings
使用量子:
首先分词并去除标点符号,否则标点符号将被计为分词。使用 kwic
的好处是您可以很容易地看到哪些词出现在您要查找的词之前和之后。
library(quanteda)
x <- kwic(tokens(df$text, remove_punct = T), dict)
data.frame(x)
docname from to pre keyword post pattern
1 text1 14 14 earnings we will also discuss unionizing efforts unionizing
2 text2 3 3 hi all unionizing and the on-going strike is unionizing
3 text2 7 7 all unionizing and the on-going strike is at the top of strike
4 text2 16 16 top of our agenda because unionizing threatens our revenue goals unionizing
5 text3 4 4 we will discuss unionizing tomorrow today the focus is unionizing
我正在分析公司会议,我想衡量会议中的人们在什么时间提出某些主题。时间意思是文字的位置。
例如,在三个会议中,人们什么时候会在我的字典中提到“unionizing”和其他词?
df <- data.frame(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))
dict <- c("unions", "strike", "unionizing")
期望的输出:
text | count | word |
---|---|---|
we're meeting here today... | (location of word) | unionizing |
hi all, unionizing an... | (location of word) | unionizing |
hi all, unionizing an... | (location of word) | strike |
hi all, unionizing an... | (location of word) | unionizing |
we will discuss unionizing tomorrow... | (location of word) | unionizing |
我问了一个关于查找第一次使用单词的问题,
library(tidyverse)
library(tidytext)
df <- tibble(text = c("we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.", "hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss unionizing tomorrow, today the focus is our Q3 earnings"))
dict <- tibble(words = c("unions", "strike", "unionizing"))
df %>%
unnest_tokens(output = "words",
input = "text",
drop = FALSE) %>%
group_by(text) %>%
mutate(word_count = row_number()) %>%
ungroup() %>%
inner_join(dict)
#> Joining, by = "words"
#> # A tibble: 5 × 3
#> text words word_count
#> <chr> <chr> <int>
#> 1 we're meeting here today to talk about our earnings. we will… unio… 14
#> 2 hi all, unionizing and the on-going strike is at the top of … unio… 3
#> 3 hi all, unionizing and the on-going strike is at the top of … stri… 8
#> 4 hi all, unionizing and the on-going strike is at the top of … unio… 17
#> 5 we will discuss unionizing tomorrow, today the focus is our … unio… 4
由 reprex package (v2.0.1)
于 2022-05-30 创建基础 R 解决方案:
作为每次观察的单个记录:
# Create a regular expression to search with:
# search_regex => character scalar
search_regex <- paste0(
dict,
collapse = "|"
)
# For each observation, loop through and then flatten result into a
# data.frame: res => data.frame
res <- do.call(
rbind,
lapply(
df$text,
function(x){
# Create an ordered vector of the words in observation:
# vec_of_words => character vector
vec_of_words <- unlist(
strsplit(
x,
"\s+"
)
)
# Compute the index where any of the search are found in the vector:
# idx => integer vector
idx <- which(
grepl(
search_regex,
vec_of_words,
ignore.case = TRUE
)
)
# Create a data.frame containing the desired result:
# data.frame => env
data.frame(
# Assign the observation to the text vector:
# text => character vector
text = x,
# Create a string containing the index of matching words:
# count => character vector
count = paste0(
idx,
collapse = ", "
),
# Create a vector of matched words: words => character vector
words = paste0(
vec_of_words[idx],
collapse = ", "
),
row.names = NULL,
stringsAsFactors = FALSE
)
}
)
)
每个匹配词一条新记录:
# Create a regular expression to search with:
# search_regex => character scalar
search_regex <- paste0(
dict,
collapse = "|"
)
# For each observation, loop through and then flatten result into a
# data.frame: res => data.frame
res <- do.call(
rbind,
lapply(
df$text,
function(x){
# Create an ordered vector of the words in observation:
# vec_of_words => character vector
vec_of_words <- unlist(
strsplit(
x,
"\s+"
)
)
# Compute the index where any of the search are found in the vector:
# idx => integer vector
idx <- which(
grepl(
search_regex,
vec_of_words,
ignore.case = TRUE
)
)
# Create a data.frame containing the desired result:
# data.frame => env
data.frame(
# Assign the observation to the text vector:
# text => character vector
text = x,
# Create a string containing the index of matching words:
# count => integer vector
count = idx,
# Create a vector of matched words: words => character vector
words = vec_of_words[idx],
row.names = NULL,
stringsAsFactors = FALSE
)
}
)
)
在Base R中我们可以使用下面的5行代码:
pat <- sprintf("\b(%s)\b",paste(dict, collapse = '|'))
words <- regmatches(df$text, gregexpr(pat, df$text))
loc <- Map(pmatch, words, strsplit(df$text, " "))
df1 <- stack(setNames(words, seq_along(words)))
transform(df1, location = unlist(loc), text = df$text[ind])
values ind location text
1 unionizing 1 14 we're meeting here today to talk about our earnings. we will also discuss unionizing efforts.
2 unionizing 2 3 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
3 strike 2 7 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
4 unionizing 2 16 hi all, unionizing and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.
5 unionizing 3 4 we will discuss unionizing tomorrow, today the focus is our Q3 earnings
使用量子:
首先分词并去除标点符号,否则标点符号将被计为分词。使用 kwic
的好处是您可以很容易地看到哪些词出现在您要查找的词之前和之后。
library(quanteda)
x <- kwic(tokens(df$text, remove_punct = T), dict)
data.frame(x)
docname from to pre keyword post pattern
1 text1 14 14 earnings we will also discuss unionizing efforts unionizing
2 text2 3 3 hi all unionizing and the on-going strike is unionizing
3 text2 7 7 all unionizing and the on-going strike is at the top of strike
4 text2 16 16 top of our agenda because unionizing threatens our revenue goals unionizing
5 text3 4 4 we will discuss unionizing tomorrow today the focus is unionizing