在 tm 0.7.3 中合并语料库

Question

使用R的文本挖掘包tm，以下在版本0.6.2、R版本3.4.3中有效：

library(tm)
a = "This is the first document."
b = "This is the second document."
c = "This is the third document."
d = "This is the fourth document."
docs1 = VectorSource(c(a,b))
docs2 = VectorSource(c(c,d))
corpus1 = Corpus(docs1)
corpus2 = Corpus(docs2)
corpus3 = c(corpus1,corpus2)
inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4

但是同样的代码在tm版本0.7.3（R版本3.4.2）中报错：

Error in UseMethod("inspect", x) :
  no applicable method for 'inspect' applied to an object of class "list"

根据vignette("tm",package="tm")，c()函数重载：

Many standard operators and functions ([, [<-, [[, [[<-, c(), lapply()) are available for corpora with semantics similar to standard R routines. E.g., c() concatenates two (or more) corpora. Applied to several text documents it returns a corpus. The metadata is automatically updated, if corpora are concatenated (i.e., merged).

但是，对于新版本，情况显然不再如此。 tm 0.7.3 中如何合并两个语料库？一个明显的解决方案是先合并文档，然后再创建语料库，但我正在寻找一种解决方案来合并两个已经存在的语料库。

Answer 1

我对 tm 包没有太多经验，所以我的回答在理解 SimpleCorpus 与 VCorpus 与其他 tm 对象 classes.

您调用 c 的输入是 class SimpleCorpus；它看起来不像 tm 附带一个专门用于此 class 的 c 方法。所以方法调度并没有调用正确的 c 来按照你想要的方式组合语料库。但是，VCorpus class (tm:::c.VCorpus).

有一个 c 方法

有两种不同的方法可以解决将 corpus3 强制转换为 list 的问题，但它们似乎会导致不同的结构。我在下面展示了两者，如果它们实现了您的最终目标，则由您决定。

1)定义`corpus3`时可以直接调用`tm:::c.VCorpus`:

> library(tm)
> 
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = Corpus(docs1)
> corpus2 = Corpus(docs2)
> 
> corpus3 = tm:::c.VCorpus(corpus1,corpus2)
> 
> inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 2, document level (indexed): 0
Content:  documents: 4

[1] This is the first document.  This is the second document. This is the third document. 
[4] This is the fourth document.

2)定义`corpus1`&`corpus2`时可以使用`VCorpus`:

> library(tm)
> 
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = VCorpus(docs1)
> corpus2 = VCorpus(docs2)
> 
> corpus3 = c(corpus1,corpus2)
> 
> inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 27

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 28

[[3]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 27

[[4]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 28

在 tm 0.7.3 中合并语料库

Combine corpora in tm 0.7.3

r

version

backwards-compatibility

text-mining

1)定义`corpus3`时可以直接调用`tm:::c.VCorpus`:

2)定义`corpus1`&`corpus2`时可以使用`VCorpus`:

在 tm 0.7.3 中合并语料库

Combine corpora in tm 0.7.3

r

version

backwards-compatibility

text-mining

1)定义corpus3时可以直接调用tm:::c.VCorpus:

2)定义corpus1&corpus2时可以使用VCorpus:

1)定义`corpus3`时可以直接调用`tm:::c.VCorpus`:

2)定义`corpus1`&`corpus2`时可以使用`VCorpus`: