计算同一说话者的发言次数,但减少中间停顿的次数

Count number of utterances by same speakers but discount number of in-between pauses

Speaker 和他们的 Utterance 的数据中:

df <- data.frame(
  Line = 1:15,
  Speaker = c("ID01.A", NA, "ID01.B",                           
              "ID17.A", NA,                                     
              "ID27.B", NA, "ID27.B", NA, "ID27.B", "ID27.A",   
              "ID27.C",                                         
              "ID33.B", "ID33.A", "ID33.C"),                  
  
  Utterance = c("Who did it?", NA, "Peter did.",                                   
                "Hello!", "(1.11)",                                                 
                "Did you", "(1.2)", "erm", "(0.9)", "go [there]?", "[heck] yeah",   
                "wow!",                                                             
                "[When] you're coming?", "[that's]", "Yes, sure."),                 
  Sequ = c(1,1,1,
           NA, NA, 
           2,2,2,2,2,2,
           NA,
           3,3,3),
  Q = c("q_wh", "", "", 
        NA, NA, 
        "q_pol", "", "", "", "", "",
        NA,
        "q_wh", "", ""))

我需要计算每个演讲者在Speaker变化和没有[=34=之前的Utterances的数量] 暂停(在圆括号中,例如 (...))和每个 Speaker 的连续系列 Utterance 之间的 NA 值。我可以通过 SpeakerSequence 计算 Utterance 的数量,但计数包括所有中间停顿和 NA:

library(dplyr)
library(tidyr)
df %>%
  fill(Speaker, .direction = 'down') %>%
  group_by(Speaker, Sequ) %>%
  mutate(N_ipu = n())
# A tibble: 15 × 6
# Groups:   Speaker, Sequ [9]
    Line Speaker Utterance              Sequ Q       N_ipu
   <int> <chr>   <chr>                 <dbl> <chr>   <int>
 1     1 ID01.A  Who did it?               1 "q_wh"      2
 2     2 ID01.A  NA                        1 ""          2
 3     3 ID01.B  Peter did.                1 ""          1
 4     4 ID17.A  Hello!                   NA  NA         2
 5     5 ID17.A  (1.11)                   NA  NA         2
 6     6 ID27.B  Did you                   2 "q_pol"     5
 7     7 ID27.B  (1.2)                     2 ""          5
 8     8 ID27.B  erm                       2 ""          5
 9     9 ID27.B  (0.9)                     2 ""          5
10    10 ID27.B  go [there]?               2 ""          5
11    11 ID27.A  [heck] yeah               2 ""          1
12    12 ID27.C  wow!                     NA  NA         1
13    13 ID33.B  [When] you're coming?     3 "q_wh"      1
14    14 ID33.A  [that's]                  3 ""          1
15    15 ID33.C  Yes, sure.                3 ""          1

如何排除停顿,使得最终结果是这样的:

# A tibble: 15 × 6
# Groups:   Speaker, Sequ [9]
    Line Speaker Utterance              Sequ Q       N_ipu
   <int> <chr>   <chr>                 <dbl> <chr>   <int>
 1     1 ID01.A  Who did it?               1 "q_wh"      1
 2     2 ID01.A  NA                        1 ""          1
 3     3 ID01.B  Peter did.                1 ""          1
 4     4 ID17.A  Hello!                   NA  NA         1
 5     5 ID17.A  (1.11)                   NA  NA         1
 6     6 ID27.B  Did you                   2 "q_pol"     3
 7     7 ID27.B  (1.2)                     2 ""          3
 8     8 ID27.B  erm                       2 ""          3
 9     9 ID27.B  (0.9)                     2 ""          3
10    10 ID27.B  go [there]?               2 ""          3
11    11 ID27.A  [heck] yeah               2 ""          1
12    12 ID27.C  wow!                     NA  NA         1
13    13 ID33.B  [When] you're coming?     3 "q_wh"      1
14    14 ID33.A  [that's]                  3 ""          1
15    15 ID33.C  Yes, sure.                3 ""          1

我们可以在 Utterance 列中用正则表达式 '\(([^\)]+)\)' 标记括号中的所有字符串,然后求和。

library(dplyr)
library(tidyr)
df %>%
  fill(Speaker, .direction = 'down') %>%
  group_by(Speaker, Sequ) %>%
  mutate(helper = ifelse(str_detect(Utterance, '\(([^\)]+)\)')
                         | is.na(Utterance), 0, 1)) %>% 
  mutate(N_ipu = sum(helper), .keep="unused")
    Line Speaker Utterance              Sequ Q       N_ipu
   <int> <chr>   <chr>                 <dbl> <chr>   <dbl>
 1     1 ID01.A  Who did it?               1 "q_wh"      1
 2     2 ID01.A  NA                        1 ""          1
 3     3 ID01.B  Peter did.                1 ""          1
 4     4 ID17.A  Hello!                   NA  NA         1
 5     5 ID17.A  (1.11)                   NA  NA         1
 6     6 ID27.B  Did you                   2 "q_pol"     3
 7     7 ID27.B  (1.2)                     2 ""          3
 8     8 ID27.B  erm                       2 ""          3
 9     9 ID27.B  (0.9)                     2 ""          3
10    10 ID27.B  go [there]?               2 ""          3
11    11 ID27.A  [heck] yeah               2 ""          1
12    12 ID27.C  wow!                     NA  NA         1
13    13 ID33.B  [When] you're coming?     3 "q_wh"      1
14    14 ID33.A  [that's]                  3 ""          1
15    15 ID33.C  Yes, sure.                3 ""          1