用单词向量拆分长字符串
Split long string by a vector of words
我希望将一些电视脚本拆分成一个包含两个变量的数据框:(1) 口语对话和 (2) 演讲者。
示例数据如下:http://www.buffyworld.com/buffy/transcripts/127_tran.html
通过以下方式加载到 R:
require(rvest)
url <- 'http://www.buffyworld.com/buffy/transcripts/127_tran.html')
url <- read_html(url)
all <- url %>% html_text()
[1] "Selfless - Buffy Episode 7x5 'Selfless' (#127) Transcript\n\nBuffy Episode #127: \"Selfless\" \n Transcript\nWritten by Drew Goddard\n Original Air Date: October 22, 2002 Skip Teaser.. Take Me To Beginning Of Episode. \n\n \n \n NB: The content of this transcript, including the characters \n and the story, belongs to Mutant Enemy. This transcript was created \n based on the broadcast episode.\n \n \n \n \n BUFFYWORLD.COM \n prefers that you direct link to this transcript rather than post \n it on your site, but you can post it on your site if you really \n want, as long as you keep everything intact, this includes the link \n to buffyworld.com and this writing. Please also keep the disclaimers \n intact.\n \n Originally transcribed for: http://www.buffyworld.com/.\n\t \n TEASER (RECAP SEGMENT):\n GILES (V.O.)\n\n Previousl... <truncated>
我现在正在尝试的是拆分每个角色的名字(我有一个完整的列表)。例如上面的'GILES'。这很好用,除了如果我在那里分开我不能保留角色名称。这是一个简化的例子。
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
all <- strsplit(all, to_parse)
这给了我想要的拆分,但不保留角色名称。
有限的问题:有什么方法可以保留我正在做的角色名称吗?
无限问题:我应该尝试其他方法吗?
提前致谢!
我认为您可以在 strsplit
中使用 perl 兼容的正则表达式。出于解释的目的,我使用了一个较短的示例字符串,但它应该是一样的:
string <- "text BUFFY more text WILLOW other text"
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
strsplit(string, paste0("(?<=", to_parse, ")"), perl = TRUE)
#[[1]]
#[1] "text BUFFY" " more text WILLOW" " other text"
正如@Lamia 所建议的那样,如果您改为在文本之前使用名称,则可以进行积极的预测。我稍微编辑了建议,以便拆分字符串包含分隔符。
strsplit(string, paste0("(?<=.(?=", to_parse, "))"), perl = TRUE)
#[[1]]
#[1] "text " "BUFFY more text " "WILLOW other text"
我希望将一些电视脚本拆分成一个包含两个变量的数据框:(1) 口语对话和 (2) 演讲者。
示例数据如下:http://www.buffyworld.com/buffy/transcripts/127_tran.html
通过以下方式加载到 R:
require(rvest)
url <- 'http://www.buffyworld.com/buffy/transcripts/127_tran.html')
url <- read_html(url)
all <- url %>% html_text()
[1] "Selfless - Buffy Episode 7x5 'Selfless' (#127) Transcript\n\nBuffy Episode #127: \"Selfless\" \n Transcript\nWritten by Drew Goddard\n Original Air Date: October 22, 2002 Skip Teaser.. Take Me To Beginning Of Episode. \n\n \n \n NB: The content of this transcript, including the characters \n and the story, belongs to Mutant Enemy. This transcript was created \n based on the broadcast episode.\n \n \n \n \n BUFFYWORLD.COM \n prefers that you direct link to this transcript rather than post \n it on your site, but you can post it on your site if you really \n want, as long as you keep everything intact, this includes the link \n to buffyworld.com and this writing. Please also keep the disclaimers \n intact.\n \n Originally transcribed for: http://www.buffyworld.com/.\n\t \n TEASER (RECAP SEGMENT):\n GILES (V.O.)\n\n Previousl... <truncated>
我现在正在尝试的是拆分每个角色的名字(我有一个完整的列表)。例如上面的'GILES'。这很好用,除了如果我在那里分开我不能保留角色名称。这是一个简化的例子。
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
all <- strsplit(all, to_parse)
这给了我想要的拆分,但不保留角色名称。
有限的问题:有什么方法可以保留我正在做的角色名称吗? 无限问题:我应该尝试其他方法吗?
提前致谢!
我认为您可以在 strsplit
中使用 perl 兼容的正则表达式。出于解释的目的,我使用了一个较短的示例字符串,但它应该是一样的:
string <- "text BUFFY more text WILLOW other text"
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
strsplit(string, paste0("(?<=", to_parse, ")"), perl = TRUE)
#[[1]]
#[1] "text BUFFY" " more text WILLOW" " other text"
正如@Lamia 所建议的那样,如果您改为在文本之前使用名称,则可以进行积极的预测。我稍微编辑了建议,以便拆分字符串包含分隔符。
strsplit(string, paste0("(?<=.(?=", to_parse, "))"), perl = TRUE)
#[[1]]
#[1] "text " "BUFFY more text " "WILLOW other text"