将 tm 语料库中的文档拆分为多个文档

Question

有点奇怪的问题，有没有办法将tm中使用语料库功能导入的语料库文档拆分成多个文档，然后可以在我的语料库中作为单独的文档重新读取？例如，如果我使用 inspect(documents[1]) 并且有类似

的东西

`<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>`

`[[1]]`

`<<PlainTextDocument (metadata: 7)>>`

The quick brown fox jumped over the lazy dog

I think cats are really cool

I want to split after this line!!!

Hi mom

Purple is my favorite color

I want to split after this line!!!

Words

And stuff

我想在“我想在这一行之后拆分！！！”这句话之后拆分文档在这种情况下出现两次，这可能吗？

使用inspect(documents)

后的最终结果是这样的

<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]

<<PlainTextDocument (metadata: 7)>>

The quick brown fox jumped over the lazy dog

I think cats are really cool

I want to split after this line!!!

[[2]]

<<PlainTextDocument (metadata: 7)>>

Hi mom

Purple is my favorite color

I want to split after this line!!!

[[3]]

<<PlainTextDocument (metadata: 7)>>

Words

And stuff

Answer 1

您可以使用 strsplit 拆分文档，然后重新创建语料库：

Corpus(VectorSource(
          strsplit(as.character(documents[[1]]),  ## coerce to character
          "I want to split after this line!!!",   
          fixed=TRUE)[[1]]))       ## use fixed=T since you  have special
                                   ## characters in your separator

要对此进行测试，我们应该首先创建一个可重现的示例：

documents <- Corpus(VectorSource(paste(readLines(textConnection("The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff")),collapse='\n')))

然后应用之前的解决方案：

split.docs <- Corpus(VectorSource(
  strsplit(as.character(documents[[1]]),  ## coerce to character
           "I want to split after this line!!!",   
           fixed=TRUE)[[1]]))

现在检查解决方案：

inspect(split.docs)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool


[[2]]
<<PlainTextDocument (metadata: 7)>>

Hi mom
Purple is my favorite color


[[3]]
<<PlainTextDocument (metadata: 7)>>

Words
And stuff

看起来 strsplit 删除了分隔符:)

Answer 2

这里有一个更简单的方法，使用 quanteda 包：

require(quanteda)
segment(mytext, what = "other", delimiter = "I want to split after this line!!!")

这会生成一个长度为 1 的列表（因为它被设计为包含多个文档，如果您愿意的话）但是如果您只想要一个向量，您总是可以 unlist() 它。

[[1]]
[1] "The quick brown fox jumped over the lazy dog\n\nI think cats are really cool\n\n"
[2] "\n    \nHi mom\n\nPurple is my favorite color\n\n"                               
[3] "\n    \nWords\n\nAnd stuff"

这可以使用 corpus(mytextSegmented) 或 tm 语料库读回 quanteda 语料库进行后续处理。

将 tm 语料库中的文档拆分为多个文档

Splitting a document from a tm Corpus into multiple documents

regex

split

r

text-analysis

tm