将身体切成碎片
segment corpus in quanteda
我有一个包含许多演讲的文本文件。该文件包含两个变量,一个用于 speech_id
,另一个用于 speech
的文本,并由竖线 |
分隔。我正在尝试使用 quanteda
中的 corpus_segment
函数将文本分成更小的文档。
.txt
文件如下所示:
Speech_id|speech1140000001|This is the first speech.1140000002|The second
speech starts here.1140000003|This is the third speech.1140000004|The fourth
speaker says this.
我尝试了各种迭代,但似乎无法正常工作。我也试过使用 readtext 包中的 readtext 函数来读取它,但没有成功。任何帮助是极大的赞赏。
corpus_segment()
应该可以正常工作。 (这是基于 quanteda >= 1.0.0。)在这里,我假设所有语音 ID 都是 10 位数字,后跟 |
字符。请注意,readtext 本来可以读取此 .txt 文件,但它本来是一行的单个 "document"。
library("quanteda")
txt <- "Speech_id|speech1140000001|This is the first speech.1140000002|The second
speech starts here.1140000003|This is the third speech.1140000004|The fourth
speaker says this."
corp <- corpus(txt)
corpseg <- corpus_segment(corp, pattern = "\d{10}\|", valuetype = "regex")
texts(corpseg)
## text1.1 text1.2
## "This is the first speech." "The second \nspeech starts here."
## text1.3 text1.4
## "This is the third speech." "The fourth \nspeaker says this."
明白了,但我们可以通过将提取的模式移动为文档名来进一步整理它。
# move the tag to docname after removing "|"
docnames(corpseg) <-
stringi::stri_replace_all_fixed(docvars(corpseg, "pattern"), "|", "")
# remove the pattern as a docvar
docvars(corpseg, "pattern") <- NULL
summary(corpseg)
## Corpus consisting of 4 documents:
##
## Text Types Tokens Sentences
## 1140000001 6 6 1
## 1140000002 6 6 1
## 1140000003 6 6 1
## 1140000004 6 6 1
##
## Source: /Users/kbenoit/Dropbox (Personal)/tmp/ascharacter/* on x86_64 by kbenoit
## Created: Tue Mar 27 07:41:05 2018
## Notes: corpus_segment.corpus(corp, pattern = "\d{10}\|", valuetype = "regex")
我有一个包含许多演讲的文本文件。该文件包含两个变量,一个用于 speech_id
,另一个用于 speech
的文本,并由竖线 |
分隔。我正在尝试使用 quanteda
中的 corpus_segment
函数将文本分成更小的文档。
.txt
文件如下所示:
Speech_id|speech1140000001|This is the first speech.1140000002|The second
speech starts here.1140000003|This is the third speech.1140000004|The fourth
speaker says this.
我尝试了各种迭代,但似乎无法正常工作。我也试过使用 readtext 包中的 readtext 函数来读取它,但没有成功。任何帮助是极大的赞赏。
corpus_segment()
应该可以正常工作。 (这是基于 quanteda >= 1.0.0。)在这里,我假设所有语音 ID 都是 10 位数字,后跟 |
字符。请注意,readtext 本来可以读取此 .txt 文件,但它本来是一行的单个 "document"。
library("quanteda")
txt <- "Speech_id|speech1140000001|This is the first speech.1140000002|The second
speech starts here.1140000003|This is the third speech.1140000004|The fourth
speaker says this."
corp <- corpus(txt)
corpseg <- corpus_segment(corp, pattern = "\d{10}\|", valuetype = "regex")
texts(corpseg)
## text1.1 text1.2
## "This is the first speech." "The second \nspeech starts here."
## text1.3 text1.4
## "This is the third speech." "The fourth \nspeaker says this."
明白了,但我们可以通过将提取的模式移动为文档名来进一步整理它。
# move the tag to docname after removing "|"
docnames(corpseg) <-
stringi::stri_replace_all_fixed(docvars(corpseg, "pattern"), "|", "")
# remove the pattern as a docvar
docvars(corpseg, "pattern") <- NULL
summary(corpseg)
## Corpus consisting of 4 documents:
##
## Text Types Tokens Sentences
## 1140000001 6 6 1
## 1140000002 6 6 1
## 1140000003 6 6 1
## 1140000004 6 6 1
##
## Source: /Users/kbenoit/Dropbox (Personal)/tmp/ascharacter/* on x86_64 by kbenoit
## Created: Tue Mar 27 07:41:05 2018
## Notes: corpus_segment.corpus(corp, pattern = "\d{10}\|", valuetype = "regex")