计算同一说话者的发言次数，但减少中间停顿的次数

Question

在 Speaker 和他们的 Utterance 的数据中：

df <- data.frame(
  Line = 1:15,
  Speaker = c("ID01.A", NA, "ID01.B",                           
              "ID17.A", NA,                                     
              "ID27.B", NA, "ID27.B", NA, "ID27.B", "ID27.A",   
              "ID27.C",                                         
              "ID33.B", "ID33.A", "ID33.C"),                  
  
  Utterance = c("Who did it?", NA, "Peter did.",                                   
                "Hello!", "(1.11)",                                                 
                "Did you", "(1.2)", "erm", "(0.9)", "go [there]?", "[heck] yeah",   
                "wow!",                                                             
                "[When] you're coming?", "[that's]", "Yes, sure."),                 
  Sequ = c(1,1,1,
           NA, NA, 
           2,2,2,2,2,2,
           NA,
           3,3,3),
  Q = c("q_wh", "", "", 
        NA, NA, 
        "q_pol", "", "", "", "", "",
        NA,
        "q_wh", "", ""))

我需要计算每个演讲者在Speaker变化和没有[=34=之前的Utterances的数量] 暂停（在圆括号中，例如 (...)）和每个 Speaker 的连续系列 Utterance 之间的 NA 值。我可以通过 Speaker 和 Sequence 计算 Utterance 的数量，但计数包括所有中间停顿和 NA:

library(dplyr) library(tidyr) df %>% fill(Speaker, .direction = 'down') %>% group_by(Speaker, Sequ) %>% mutate(N_ipu = n()) # A tibble: 15 × 6 # Groups: Speaker, Sequ [9] Line Speaker Utterance Sequ Q N_ipu <int> <chr> <chr> <dbl> <chr> <int> 1 1 ID01.A Who did it? 1 "q_wh" 2 2 2 ID01.A NA 1 "" 2 3 3 ID01.B Peter did. 1 "" 1 4 4 ID17.A Hello! NA NA 2 5 5 ID17.A (1.11) NA NA 2 6 6 ID27.B Did you 2 "q_pol" 5 7 7 ID27.B (1.2) 2 "" 5 8 8 ID27.B erm 2 "" 5 9 9 ID27.B (0.9) 2 "" 5 10 10 ID27.B go [there]? 2 "" 5 11 11 ID27.A [heck] yeah 2 "" 1 12 12 ID27.C wow! NA NA 1 13 13 ID33.B [When] you're coming? 3 "q_wh" 1 14 14 ID33.A [that's] 3 "" 1 15 15 ID33.C Yes, sure. 3 "" 1

如何排除停顿，使得最终结果是这样的：

# A tibble: 15 × 6 # Groups: Speaker, Sequ [9] Line Speaker Utterance Sequ Q N_ipu <int> <chr> <chr> <dbl> <chr> <int> 1 1 ID01.A Who did it? 1 "q_wh" 1 2 2 ID01.A NA 1 "" 1 3 3 ID01.B Peter did. 1 "" 1 4 4 ID17.A Hello! NA NA 1 5 5 ID17.A (1.11) NA NA 1 6 6 ID27.B Did you 2 "q_pol" 3 7 7 ID27.B (1.2) 2 "" 3 8 8 ID27.B erm 2 "" 3 9 9 ID27.B (0.9) 2 "" 3 10 10 ID27.B go [there]? 2 "" 3 11 11 ID27.A [heck] yeah 2 "" 1 12 12 ID27.C wow! NA NA 1 13 13 ID33.B [When] you're coming? 3 "q_wh" 1 14 14 ID33.A [that's] 3 "" 1 15 15 ID33.C Yes, sure. 3 "" 1

Answer 1

我们可以在 Utterance 列中用正则表达式 '\(([^\)]+)\)' 标记括号中的所有字符串，然后求和。

library(dplyr)
library(tidyr)
df %>%
  fill(Speaker, .direction = 'down') %>%
  group_by(Speaker, Sequ) %>%
  mutate(helper = ifelse(str_detect(Utterance, '\(([^\)]+)\)')
                         | is.na(Utterance), 0, 1)) %>% 
  mutate(N_ipu = sum(helper), .keep="unused")

    Line Speaker Utterance              Sequ Q       N_ipu
   <int> <chr>   <chr>                 <dbl> <chr>   <dbl>
 1     1 ID01.A  Who did it?               1 "q_wh"      1
 2     2 ID01.A  NA                        1 ""          1
 3     3 ID01.B  Peter did.                1 ""          1
 4     4 ID17.A  Hello!                   NA  NA         1
 5     5 ID17.A  (1.11)                   NA  NA         1
 6     6 ID27.B  Did you                   2 "q_pol"     3
 7     7 ID27.B  (1.2)                     2 ""          3
 8     8 ID27.B  erm                       2 ""          3
 9     9 ID27.B  (0.9)                     2 ""          3
10    10 ID27.B  go [there]?               2 ""          3
11    11 ID27.A  [heck] yeah               2 ""          1
12    12 ID27.C  wow!                     NA  NA         1
13    13 ID33.B  [When] you're coming?     3 "q_wh"      1
14    14 ID33.A  [that's]                  3 ""          1
15    15 ID33.C  Yes, sure.                3 ""          1

计算同一说话者的发言次数，但减少中间停顿的次数

Count number of utterances by same speakers but discount number of in-between pauses

r

dplyr

tidyr