将对话小标题转换为 .txt,然后再转换回来
Converting a dialogue tibble to .txt, and back again
我想将代表对话的小标题转换成可以在文本编辑器中手动编辑的 .txt,然后return编辑成小标题进行处理。
我遇到的主要挑战是以某种方式分隔文本块,使它们在编辑后可以 re-imported 为类似格式,同时保留 "Speaker" 名称。
速度很重要,因为文件的体积和每个文本段的长度都很大。
这是输入小标题:
tibble::tribble(
~word, ~speakerTag,
"been", 1L,
"going", 1L,
"on", 1L,
"and", 1L,
"what", 1L,
"your", 1L,
"goals", 1L,
"are.", 1L,
"Yeah,", 2L,
"so", 2L,
"so", 2L,
"John", 2L,
"has", 2L,
"15", 2L
)
这是 .txt 中所需的输出:
###Speaker 1###
been going on and what your goals are.
###Speaker 2###
Yeah, so so John has 15
这是手动更正错误后所需的return:
~word, ~speakerTag,
"been", 1L,
"going", 1L,
"on", 1L,
"and", 1L,
"what", 1L,
"your", 1L,
"goals", 1L,
"in", 1L,
"r", 1L,
"Yeah,", 2L,
"so", 2L,
"so", 2L,
"John", 2L,
"hates", 2L,
"50", 2L
)
一种方法是在每个 speakerTag
的开头添加演讲者姓名 "\n"
library(data.table)
library(dplyr)
library(tidyr)
setDT(df)[, word := replace(word, 1, paste0("\n\nSpeaker",
first(speakerTag), '\n\n', first(word))), rleid(speakerTag)]
我们可以使用
将其写入文本文件
writeLines(paste(df$word, collapse = " "), 'Downloads/temp.txt')
看起来像这样:
cat(paste(df$word, collapse = " "))
#Speaker1
#
#been going on and what your goals are.
#
#Speaker2
#
#Yeah, so so John has 15
要在 R 中读回,我们可以这样做:
read.table('Downloads/temp.txt', sep="\t", col.names = 'word') %>%
mutate(SpeakerTag = replace(word, c(FALSE, TRUE), NA)) %>%
fill(SpeakerTag) %>%
slice(seq(2, n(), 2)) %>%
separate_rows(word, sep = "\s") %>%
filter(word != '')
# word SpeakerTag
#1 been Speaker1
#2 going Speaker1
#3 on Speaker1
#4 and Speaker1
#5 what Speaker1
#6 your Speaker1
#7 goals Speaker1
#8 are. Speaker1
#9 Yeah, Speaker2
#10 so Speaker2
#11 so Speaker2
#12 John Speaker2
#13 has Speaker2
#14 15 Speaker2
显然,如果不需要,我们可以删除 SpeakerTag
列中的 "Speaker"
部分。
我想将代表对话的小标题转换成可以在文本编辑器中手动编辑的 .txt,然后return编辑成小标题进行处理。
我遇到的主要挑战是以某种方式分隔文本块,使它们在编辑后可以 re-imported 为类似格式,同时保留 "Speaker" 名称。
速度很重要,因为文件的体积和每个文本段的长度都很大。
这是输入小标题:
tibble::tribble(
~word, ~speakerTag,
"been", 1L,
"going", 1L,
"on", 1L,
"and", 1L,
"what", 1L,
"your", 1L,
"goals", 1L,
"are.", 1L,
"Yeah,", 2L,
"so", 2L,
"so", 2L,
"John", 2L,
"has", 2L,
"15", 2L
)
这是 .txt 中所需的输出:
###Speaker 1###
been going on and what your goals are.
###Speaker 2###
Yeah, so so John has 15
这是手动更正错误后所需的return:
~word, ~speakerTag,
"been", 1L,
"going", 1L,
"on", 1L,
"and", 1L,
"what", 1L,
"your", 1L,
"goals", 1L,
"in", 1L,
"r", 1L,
"Yeah,", 2L,
"so", 2L,
"so", 2L,
"John", 2L,
"hates", 2L,
"50", 2L
)
一种方法是在每个 speakerTag
"\n"
library(data.table)
library(dplyr)
library(tidyr)
setDT(df)[, word := replace(word, 1, paste0("\n\nSpeaker",
first(speakerTag), '\n\n', first(word))), rleid(speakerTag)]
我们可以使用
将其写入文本文件writeLines(paste(df$word, collapse = " "), 'Downloads/temp.txt')
看起来像这样:
cat(paste(df$word, collapse = " "))
#Speaker1
#
#been going on and what your goals are.
#
#Speaker2
#
#Yeah, so so John has 15
要在 R 中读回,我们可以这样做:
read.table('Downloads/temp.txt', sep="\t", col.names = 'word') %>%
mutate(SpeakerTag = replace(word, c(FALSE, TRUE), NA)) %>%
fill(SpeakerTag) %>%
slice(seq(2, n(), 2)) %>%
separate_rows(word, sep = "\s") %>%
filter(word != '')
# word SpeakerTag
#1 been Speaker1
#2 going Speaker1
#3 on Speaker1
#4 and Speaker1
#5 what Speaker1
#6 your Speaker1
#7 goals Speaker1
#8 are. Speaker1
#9 Yeah, Speaker2
#10 so Speaker2
#11 so Speaker2
#12 John Speaker2
#13 has Speaker2
#14 15 Speaker2
显然,如果不需要,我们可以删除 SpeakerTag
列中的 "Speaker"
部分。