R - 如何计算 df 行中的所有单词并将输出添加到新列?理想情况下使用 tidyverse 或 tidytext
R - how to count all words in a df row and add output to a new column? Ideally with tidyverse or tidytext
我正在尝试查找文本中单词的位置,以及同一文本的总字数。
# library(tidyverse)
# library(tidytext)
txt<-tibble(text=c("we're meeting here today to talk about our earnings. we will also discuss global_warming.", "hi all, global_warming and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss global_warming tomorrow, today the focus is our Q3 earnings"))
dict <- tibble(words=c("global_warming"))
x<-txt %>% unnest_tokens(output = "words",
input = "text",
drop = FALSE) %>%
group_by(text) %>%
mutate(word_loc = row_number()) %>%
ungroup() %>%
inner_join(dict)
这给了我以下输出:
# A tibble: 3 x 3
text words word_loc
<chr> <chr> <int>
1 we're meeting here today to talk about our earnings. we will also discuss global_warming. global_warm… 14
2 hi all, global_warming and the on-going strike is at the top of our agenda, because unioni… global_warm… 3
3 we will discuss global_warming tomorrow, today the focus is our Q3 earnings global_warm… 4
如何添加一列,让我知道每一行的总字数?
我们可以使用 str_count
来获取每个字符串的单词总数,其中 \S+
计算 non-space 个字符上的所有序列。
library(tidyverse)
x %>%
mutate(count = str_count(text, "\S+"))
或者另一个使用 base R 的选项:
x$count <- lengths(gregexpr("\S+", x$text))
输出
text words word_loc count
<chr> <chr> <int> <int>
1 we're meeting here today to talk about our ea… glob… 14 14
2 hi all, global_warming and the on-going strik… glob… 3 20
3 we will discuss global_warming tomorrow, toda… glob… 4 12
或者如果你想计算缩略词、带连字符的单词等,那么你可以使用 \w+
代替:
x %>%
mutate(count = str_count(text, "\w+"))
text words word_loc count
<chr> <chr> <int> <int>
1 we're meeting here today to talk about our ea… glob… 14 15
2 hi all, global_warming and the on-going strik… glob… 3 21
3 we will discuss global_warming tomorrow, toda… glob… 4 12
我正在尝试查找文本中单词的位置,以及同一文本的总字数。
# library(tidyverse)
# library(tidytext)
txt<-tibble(text=c("we're meeting here today to talk about our earnings. we will also discuss global_warming.", "hi all, global_warming and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss global_warming tomorrow, today the focus is our Q3 earnings"))
dict <- tibble(words=c("global_warming"))
x<-txt %>% unnest_tokens(output = "words",
input = "text",
drop = FALSE) %>%
group_by(text) %>%
mutate(word_loc = row_number()) %>%
ungroup() %>%
inner_join(dict)
这给了我以下输出:
# A tibble: 3 x 3
text words word_loc
<chr> <chr> <int>
1 we're meeting here today to talk about our earnings. we will also discuss global_warming. global_warm… 14
2 hi all, global_warming and the on-going strike is at the top of our agenda, because unioni… global_warm… 3
3 we will discuss global_warming tomorrow, today the focus is our Q3 earnings global_warm… 4
如何添加一列,让我知道每一行的总字数?
我们可以使用 str_count
来获取每个字符串的单词总数,其中 \S+
计算 non-space 个字符上的所有序列。
library(tidyverse)
x %>%
mutate(count = str_count(text, "\S+"))
或者另一个使用 base R 的选项:
x$count <- lengths(gregexpr("\S+", x$text))
输出
text words word_loc count
<chr> <chr> <int> <int>
1 we're meeting here today to talk about our ea… glob… 14 14
2 hi all, global_warming and the on-going strik… glob… 3 20
3 we will discuss global_warming tomorrow, toda… glob… 4 12
或者如果你想计算缩略词、带连字符的单词等,那么你可以使用 \w+
代替:
x %>%
mutate(count = str_count(text, "\w+"))
text words word_loc count
<chr> <chr> <int> <int>
1 we're meeting here today to talk about our ea… glob… 14 15
2 hi all, global_warming and the on-going strik… glob… 3 21
3 we will discuss global_warming tomorrow, toda… glob… 4 12