将 tm 语料库中的文档拆分为多个文档
Splitting a document from a tm Corpus into multiple documents
有点奇怪的问题,有没有办法将tm中使用语料库功能导入的语料库文档拆分成多个文档,然后可以在我的语料库中作为单独的文档重新读取?例如,如果我使用
inspect(documents[1])
并且有类似
的东西
`<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>`
`[[1]]`
`<<PlainTextDocument (metadata: 7)>>`
The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff
我想在“我想在这一行之后拆分!!!”这句话之后拆分文档在这种情况下出现两次,这可能吗?
使用inspect(documents)
后的最终结果是这样的
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
[[2]]
<<PlainTextDocument (metadata: 7)>>
Hi mom
Purple is my favorite color
I want to split after this line!!!
[[3]]
<<PlainTextDocument (metadata: 7)>>
Words
And stuff
您可以使用 strsplit
拆分文档,然后重新创建语料库:
Corpus(VectorSource(
strsplit(as.character(documents[[1]]), ## coerce to character
"I want to split after this line!!!",
fixed=TRUE)[[1]])) ## use fixed=T since you have special
## characters in your separator
要对此进行测试,我们应该首先创建一个可重现的示例:
documents <- Corpus(VectorSource(paste(readLines(textConnection("The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff")),collapse='\n')))
然后应用之前的解决方案:
split.docs <- Corpus(VectorSource(
strsplit(as.character(documents[[1]]), ## coerce to character
"I want to split after this line!!!",
fixed=TRUE)[[1]]))
现在检查解决方案:
inspect(split.docs)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool
[[2]]
<<PlainTextDocument (metadata: 7)>>
Hi mom
Purple is my favorite color
[[3]]
<<PlainTextDocument (metadata: 7)>>
Words
And stuff
看起来 strsplit
删除了分隔符:)
这里有一个更简单的方法,使用 quanteda
包:
require(quanteda)
segment(mytext, what = "other", delimiter = "I want to split after this line!!!")
这会生成一个长度为 1 的列表(因为它被设计为包含多个文档,如果您愿意的话)但是如果您只想要一个向量,您总是可以 unlist()
它。
[[1]]
[1] "The quick brown fox jumped over the lazy dog\n\nI think cats are really cool\n\n"
[2] "\n \nHi mom\n\nPurple is my favorite color\n\n"
[3] "\n \nWords\n\nAnd stuff"
这可以使用 corpus(mytextSegmented)
或 tm
语料库读回 quanteda
语料库进行后续处理。
有点奇怪的问题,有没有办法将tm中使用语料库功能导入的语料库文档拆分成多个文档,然后可以在我的语料库中作为单独的文档重新读取?例如,如果我使用
inspect(documents[1])
并且有类似
`<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>`
`[[1]]`
`<<PlainTextDocument (metadata: 7)>>`
The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff
我想在“我想在这一行之后拆分!!!”这句话之后拆分文档在这种情况下出现两次,这可能吗?
使用inspect(documents)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
[[2]]
<<PlainTextDocument (metadata: 7)>>
Hi mom
Purple is my favorite color
I want to split after this line!!!
[[3]]
<<PlainTextDocument (metadata: 7)>>
Words
And stuff
您可以使用 strsplit
拆分文档,然后重新创建语料库:
Corpus(VectorSource(
strsplit(as.character(documents[[1]]), ## coerce to character
"I want to split after this line!!!",
fixed=TRUE)[[1]])) ## use fixed=T since you have special
## characters in your separator
要对此进行测试,我们应该首先创建一个可重现的示例:
documents <- Corpus(VectorSource(paste(readLines(textConnection("The quick brown fox jumped over the lazy dog
I think cats are really cool
I want to split after this line!!!
Hi mom
Purple is my favorite color
I want to split after this line!!!
Words
And stuff")),collapse='\n')))
然后应用之前的解决方案:
split.docs <- Corpus(VectorSource(
strsplit(as.character(documents[[1]]), ## coerce to character
"I want to split after this line!!!",
fixed=TRUE)[[1]]))
现在检查解决方案:
inspect(split.docs)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
The quick brown fox jumped over the lazy dog
I think cats are really cool
[[2]]
<<PlainTextDocument (metadata: 7)>>
Hi mom
Purple is my favorite color
[[3]]
<<PlainTextDocument (metadata: 7)>>
Words
And stuff
看起来 strsplit
删除了分隔符:)
这里有一个更简单的方法,使用 quanteda
包:
require(quanteda)
segment(mytext, what = "other", delimiter = "I want to split after this line!!!")
这会生成一个长度为 1 的列表(因为它被设计为包含多个文档,如果您愿意的话)但是如果您只想要一个向量,您总是可以 unlist()
它。
[[1]]
[1] "The quick brown fox jumped over the lazy dog\n\nI think cats are really cool\n\n"
[2] "\n \nHi mom\n\nPurple is my favorite color\n\n"
[3] "\n \nWords\n\nAnd stuff"
这可以使用 corpus(mytextSegmented)
或 tm
语料库读回 quanteda
语料库进行后续处理。