在将 webscraped 输出转换为 tibble 时，根据任意字符发出新行应该从哪里开始的信号

Question

我正在抓取一个电视脚本，然后尝试对其进行清理。这是我目前所拥有的：

library(tidyverse)
library(rvest)

s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm')

s1_e1 <- s1_e1 %>%
  html_nodes("p") %>%
  html_text() 

s1_e1 <- str_replace_all(string = s1_e1, pattern = "\s*\([^\)]+\)", replacement = "")
s1_e1 <- str_replace_all(string = s1_e1, pattern = "\s*\[[^\]]+\]", replacement = "") 
s1_e1 <- str_squish(s1_e1)

s1_e1 <- s1_e1 %>% 
  as_tibble() %>% 
  filter(value!="") %>% 
  mutate(season = "27",
         episode_num = "1",
         airdate_orig = str_sub(.$value[1], -12),
         episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>% 
  slice(-1)

这给了我以下内容：

# A tibble: 38 x 5
   value                                    season episode_num airdate_orig episode_name
   <chr>                                    <chr>  <chr>       <chr>        <chr>       
 1 ROSE: Bye! JACKIE: See you later!        27     1           26 Mar, 2005 Rose        
 2 TANNOY: This is a customer announcement… 27     1           26 Mar, 2005 Rose        
 3 ROSE: You pulled his arm off. DOCTOR: Y… 27     1           26 Mar, 2005 Rose        
 4 ROSE: That's just not funny. That's sic… 27     1           26 Mar, 2005 Rose        
 5 TAXI DRIVER: Watch it!                   27     1           26 Mar, 2005 Rose        
 6 TELEVISION: The whole of Central London… 27     1           26 Mar, 2005 Rose        
 7 JACKIE: There's no point in getting up,… 27     1           26 Mar, 2005 Rose        
 8 JACKIE: There's Finch's. You could try … 27     1           26 Mar, 2005 Rose        
 9 ROSE: It's about last night. He's part … 27     1           26 Mar, 2005 Rose        
10 ROSE: Don't mind the mess. Do you want … 27     1           26 Mar, 2005 Rose        
# … with 28 more rows

我希望每一行都是一个新角色的演讲。如您所见，值得庆幸的是，该脚本将说话者大写，然后在新语音之前有一个冒号和一个 space，即 ROSE: 或 TANNOY: 。有没有办法向 R 表明我希望 tibble 的每一行都以此大写文本开头，后跟一个冒号，并在该行中继续，直到出现另一个大写单词后跟一个冒号？

例如，第一行将以 ROSE: Bye! 开头，第二行将以 JACKIE: See you later! 开头，第三行将以 TANNOY: This is a customer announcement… 开头，直到到达另一个大写单词后跟一个冒号，等等。

此外，如果有人对我如何将 stringr 函数集成到 dplyr 块中有任何建议，请告诉我。如果最好的话，我可以对此单独做一个 post，但是我在尝试这样做时总是遇到错误（尽管以上是功能性的）。

非常感谢！

Answer 1

您可以使用前瞻模式：

library(tidyverse)

s1_e1 %>% 
  mutate(value=str_split(value, "\s(?=[A-Z]+:)")) %>% 
  unnest(value)

returns

# A tibble: 322 x 5
   value                                                             season episode_num airdate_orig episode_name
   <chr>                                                             <chr>  <chr>       <chr>        <chr>       
 1 ROSE: Bye!                                                        27     1           26 Mar, 2005 Rose        
 2 JACKIE: See you later!                                            27     1           26 Mar, 2005 Rose        
 3 TANNOY: This is a customer announcement. The store will be closi~ 27     1           26 Mar, 2005 Rose        
 4 GUARD: Oi!                                                        27     1           26 Mar, 2005 Rose        
 5 ROSE: Wilson? Wilson, I've got the lottery money. Wilson, are yo~ 27     1           26 Mar, 2005 Rose        
 6 ROSE: I can't hang about 'cos they're closing the shop. Wilson! ~ 27     1           26 Mar, 2005 Rose        
 7 ROSE: Hello? Hello, Wilson, it's Rose. Hello? Wilson?             27     1           26 Mar, 2005 Rose        
 8 ROSE: Wilson? Wilson!                                             27     1           26 Mar, 2005 Rose        
 9 ROSE: You're kidding me.                                          27     1           26 Mar, 2005 Rose        
10 ROSE: Is that someone mucking about? Who is it?                   27     1           26 Mar, 2005 Rose

简化的工作流程

您确实可以将所有操作都放在一个管道中：

s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm') %>%
  html_nodes("p") %>%
  html_text() %>% 
  tibble(value = .) %>% 
  mutate(value = str_squish(str_replace_all(value, "(\s*\([^\)]+\)|\s*\[[^\]]+\])", ""))) %>% 
  filter(value!="") %>% 
  mutate(season = "27",
         episode_num = "1",
         airdate_orig = str_sub(.$value[1], -12),
         episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>% 
  slice(-1) %>% 
  mutate(value=str_split(value, "\s(?=[A-Z]+:)")) %>% 
  unnest(value)

在将 webscraped 输出转换为 tibble 时，根据任意字符发出新行应该从哪里开始的信号

Signalling where a new row should start based on arbitrary characters when converting webscraped output to a tibble

nlp

r

stringr

dplyr

简化的工作流程