将身体切成碎片

segment corpus in quanteda

我有一个包含许多演讲的文本文件。该文件包含两个变量,一个用于 speech_id,另一个用于 speech 的文本,并由竖线 | 分隔。我正在尝试使用 quanteda 中的 corpus_segment 函数将文本分成更小的文档。

.txt 文件如下所示:

Speech_id|speech1140000001|This is the first speech.1140000002|The second 
speech starts here.1140000003|This is the third speech.1140000004|The fourth 
speaker says this.

我尝试了各种迭代,但似乎无法正常工作。我也试过使用 readtext 包中的 readtext 函数来读取它,但没有成功。任何帮助是极大的赞赏。

corpus_segment() 应该可以正常工作。 (这是基于 quanteda >= 1.0.0。)在这里,我假设所有语音 ID 都是 10 位数字,后跟 | 字符。请注意,readtext 本来可以读取此 .txt 文件,但它本来是一行的单个 "document"。

library("quanteda")

txt <- "Speech_id|speech1140000001|This is the first speech.1140000002|The second 
speech starts here.1140000003|This is the third speech.1140000004|The fourth 
speaker says this."

corp <- corpus(txt)

corpseg <- corpus_segment(corp, pattern = "\d{10}\|", valuetype = "regex")
texts(corpseg)
##                     text1.1                            text1.2 
## "This is the first speech." "The second \nspeech starts here." 
##                     text1.3                            text1.4 
## "This is the third speech."  "The fourth \nspeaker says this." 

明白了,但我们可以通过将提取的模式移动为文档名来进一步整理它。

# move the tag to docname after removing "|"
docnames(corpseg) <- 
    stringi::stri_replace_all_fixed(docvars(corpseg, "pattern"), "|", "")
# remove the pattern as a docvar
docvars(corpseg, "pattern") <- NULL

summary(corpseg)
## Corpus consisting of 4 documents:
##     
##       Text Types Tokens Sentences
## 1140000001     6      6         1
## 1140000002     6      6         1
## 1140000003     6      6         1
## 1140000004     6      6         1
## 
## Source: /Users/kbenoit/Dropbox (Personal)/tmp/ascharacter/* on x86_64 by kbenoit
## Created: Tue Mar 27 07:41:05 2018
## Notes: corpus_segment.corpus(corp, pattern = "\d{10}\|", valuetype = "regex")