在 tm 0.7.3 中合并语料库

Combine corpora in tm 0.7.3

使用R的文本挖掘包tm,以下在版本0.6.2、R版本3.4.3中有效:

library(tm)
a = "This is the first document."
b = "This is the second document."
c = "This is the third document."
d = "This is the fourth document."
docs1 = VectorSource(c(a,b))
docs2 = VectorSource(c(c,d))
corpus1 = Corpus(docs1)
corpus2 = Corpus(docs2)
corpus3 = c(corpus1,corpus2)
inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4

但是同样的代码在tm版本0.7.3(R版本3.4.2)中报错:

Error in UseMethod("inspect", x) :
  no applicable method for 'inspect' applied to an object of class "list"

根据vignette("tm",package="tm")c()函数重载:

Many standard operators and functions ([, [<-, [[, [[<-, c(), lapply()) are available for corpora with semantics similar to standard R routines. E.g., c() concatenates two (or more) corpora. Applied to several text documents it returns a corpus. The metadata is automatically updated, if corpora are concatenated (i.e., merged).

但是,对于新版本,情况显然不再如此。 tm 0.7.3 中如何合并两个语料库?一个明显的解决方案是先合并文档,然后再创建语料库,但我正在寻找一种解决方案来合并两个已经存在的语料库。

我对 tm 包没有太多经验,所以我的回答在理解 SimpleCorpusVCorpus 与其他 tm 对象 classes.

您调用 c 的输入是 class SimpleCorpus;它看起来不像 tm 附带一个专门用于此 class 的 c 方法。所以方法调度并没有调用正确的 c 来按照你想要的方式组合语料库。但是,VCorpus class (tm:::c.VCorpus).

有一个 c 方法

有两种不同的方法可以解决将 corpus3 强制转换为 list 的问题,但它们似乎会导致不同的结构。我在下面展示了两者,如果它们实现了您的最终目标,则由您决定。

1)定义corpus3时可以直接调用tm:::c.VCorpus:

> library(tm)
> 
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = Corpus(docs1)
> corpus2 = Corpus(docs2)
> 
> corpus3 = tm:::c.VCorpus(corpus1,corpus2)
> 
> inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 2, document level (indexed): 0
Content:  documents: 4

[1] This is the first document.  This is the second document. This is the third document. 
[4] This is the fourth document.

2)定义corpus1&corpus2时可以使用VCorpus:

> library(tm)
> 
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = VCorpus(docs1)
> corpus2 = VCorpus(docs2)
> 
> corpus3 = c(corpus1,corpus2)
> 
> inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 27

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 28

[[3]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 27

[[4]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 28