将 tm 语料库中的文档拆分为多个文档

Splitting a document from a tm Corpus into multiple documents

有点奇怪的问题,有没有办法将tm中使用语料库功能导入的语料库文档拆分成多个文档,然后可以在我的语料库中作为单独的文档重新读取?例如,如果我使用 inspect(documents[1]) 并且有类似

的东西
`<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>`

`[[1]]`

`<<PlainTextDocument (metadata: 7)>>`

The quick brown fox jumped over the lazy dog

I think cats are really cool

I want to split after this line!!!

Hi mom

Purple is my favorite color

I want to split after this line!!!

Words

And stuff

我想在“我想在这一行之后拆分!!!”这句话之后拆分文档在这种情况下出现两次,这可能吗?

使用inspect(documents)

后的最终结果是这样的

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]

<<PlainTextDocument (metadata: 7)>>

The quick brown fox jumped over the lazy dog

I think cats are really cool

I want to split after this line!!!

[[2]]

<<PlainTextDocument (metadata: 7)>>

Hi mom

Purple is my favorite color

I want to split after this line!!!

[[3]]

<<PlainTextDocument (metadata: 7)>>

Words

And stuff

您可以使用 strsplit 拆分文档,然后重新创建语料库:

Corpus(VectorSource(
          strsplit(as.character(documents[[1]]),  ## coerce to character
          "I want to split after this line!!!",   
          fixed=TRUE)[[1]]))       ## use fixed=T since you  have special
                                   ## characters in your separator  

要对此进行测试,我们应该首先创建一个可重现的示例:

documents <- Corpus(VectorSource(paste(readLines(textConnection("The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff")),collapse='\n')))

然后应用之前的解决方案:

split.docs <- Corpus(VectorSource(
  strsplit(as.character(documents[[1]]),  ## coerce to character
           "I want to split after this line!!!",   
           fixed=TRUE)[[1]]))  

现在检查解决方案:

inspect(split.docs)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool


[[2]]
<<PlainTextDocument (metadata: 7)>>

Hi mom
Purple is my favorite color


[[3]]
<<PlainTextDocument (metadata: 7)>>

Words
And stuff

看起来 strsplit 删除了分隔符:)

这里有一个更简单的方法,使用 quanteda 包:

require(quanteda)
segment(mytext, what = "other", delimiter = "I want to split after this line!!!")

这会生成一个长度为 1 的列表(因为它被设计为包含多个文档,如果您愿意的话)但是如果您只想要一个向量,您总是可以 unlist() 它。

[[1]]
[1] "The quick brown fox jumped over the lazy dog\n\nI think cats are really cool\n\n"
[2] "\n    \nHi mom\n\nPurple is my favorite color\n\n"                               
[3] "\n    \nWords\n\nAnd stuff" 

这可以使用 corpus(mytextSegmented)tm 语料库读回 quanteda 语料库进行后续处理。