R - 如何计算 df 行中的所有单词并将输出添加到新列?理想情况下使用 tidyverse 或 tidytext

R - how to count all words in a df row and add output to a new column? Ideally with tidyverse or tidytext

我正在尝试查找文本中单词的位置,以及同一文本的总字数。

# library(tidyverse)
# library(tidytext)
txt<-tibble(text=c("we're meeting here today to talk about our earnings. we will also discuss global_warming.", "hi all, global_warming and the on-going strike is at the top of our agenda, because unionizing threatens our revenue goals.", "we will discuss global_warming tomorrow, today the focus is our Q3 earnings"))
dict <- tibble(words=c("global_warming"))
x<-txt %>% unnest_tokens(output = "words",
                          input = "text",
                          drop = FALSE) %>%
  group_by(text) %>%
  mutate(word_loc = row_number()) %>%
  ungroup() %>%
  inner_join(dict)

这给了我以下输出:

# A tibble: 3 x 3
  text                                                                                        words        word_loc
  <chr>                                                                                       <chr>           <int>
1 we're meeting here today to talk about our earnings. we will also discuss global_warming.   global_warm…       14
2 hi all, global_warming and the on-going strike is at the top of our agenda, because unioni… global_warm…        3
3 we will discuss global_warming tomorrow, today the focus is our Q3 earnings                 global_warm…        4

如何添加一列,让我知道每一行的总字数?

我们可以使用 str_count 来获取每个字符串的单词总数,其中 \S+ 计算 non-space 个字符上的所有序列。

library(tidyverse)

x %>%
  mutate(count = str_count(text, "\S+"))

或者另一个使用 base R 的选项:

x$count <- lengths(gregexpr("\S+", x$text))

输出

  text                                           words word_loc count
  <chr>                                          <chr>    <int> <int>
1 we're meeting here today to talk about our ea… glob…       14    14
2 hi all, global_warming and the on-going strik… glob…        3    20
3 we will discuss global_warming tomorrow, toda… glob…        4    12

或者如果你想计算缩略词、带连字符的单词等,那么你可以使用 \w+ 代替:

x %>%
  mutate(count = str_count(text, "\w+"))

  text                                           words word_loc count
  <chr>                                          <chr>    <int> <int>
1 we're meeting here today to talk about our ea… glob…       14    15
2 hi all, global_warming and the on-going strik… glob…        3    21
3 we will discuss global_warming tomorrow, toda… glob…        4    12