在 tm 0.7.3 中合并语料库
Combine corpora in tm 0.7.3
使用R的文本挖掘包tm
,以下在版本0.6.2、R版本3.4.3中有效:
library(tm)
a = "This is the first document."
b = "This is the second document."
c = "This is the third document."
d = "This is the fourth document."
docs1 = VectorSource(c(a,b))
docs2 = VectorSource(c(c,d))
corpus1 = Corpus(docs1)
corpus2 = Corpus(docs2)
corpus3 = c(corpus1,corpus2)
inspect(corpus3)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 4
但是同样的代码在tm
版本0.7.3(R版本3.4.2)中报错:
Error in UseMethod("inspect", x) :
no applicable method for 'inspect' applied to an object of class "list"
根据vignette("tm",package="tm")
,c()
函数重载:
Many standard operators and functions ([, [<-, [[, [[<-, c(), lapply()
) are available for corpora with semantics similar to standard
R routines. E.g., c()
concatenates two (or more) corpora. Applied to
several text documents it returns a corpus. The metadata is
automatically updated, if corpora are concatenated (i.e., merged).
但是,对于新版本,情况显然不再如此。 tm
0.7.3 中如何合并两个语料库?一个明显的解决方案是先合并文档,然后再创建语料库,但我正在寻找一种解决方案来合并两个已经存在的语料库。
我对 tm
包没有太多经验,所以我的回答在理解 SimpleCorpus
与 VCorpus
与其他 tm
对象 classes.
您调用 c
的输入是 class SimpleCorpus
;它看起来不像 tm
附带一个专门用于此 class 的 c
方法。所以方法调度并没有调用正确的 c
来按照你想要的方式组合语料库。但是,VCorpus
class (tm:::c.VCorpus
).
有一个 c
方法
有两种不同的方法可以解决将 corpus3
强制转换为 list
的问题,但它们似乎会导致不同的结构。我在下面展示了两者,如果它们实现了您的最终目标,则由您决定。
1)定义corpus3
时可以直接调用tm:::c.VCorpus
:
> library(tm)
>
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = Corpus(docs1)
> corpus2 = Corpus(docs2)
>
> corpus3 = tm:::c.VCorpus(corpus1,corpus2)
>
> inspect(corpus3)
<<VCorpus>>
Metadata: corpus specific: 2, document level (indexed): 0
Content: documents: 4
[1] This is the first document. This is the second document. This is the third document.
[4] This is the fourth document.
2)定义corpus1
&corpus2
时可以使用VCorpus
:
> library(tm)
>
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = VCorpus(docs1)
> corpus2 = VCorpus(docs2)
>
> corpus3 = c(corpus1,corpus2)
>
> inspect(corpus3)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 4
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 27
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 28
[[3]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 27
[[4]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 28
使用R的文本挖掘包tm
,以下在版本0.6.2、R版本3.4.3中有效:
library(tm)
a = "This is the first document."
b = "This is the second document."
c = "This is the third document."
d = "This is the fourth document."
docs1 = VectorSource(c(a,b))
docs2 = VectorSource(c(c,d))
corpus1 = Corpus(docs1)
corpus2 = Corpus(docs2)
corpus3 = c(corpus1,corpus2)
inspect(corpus3)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 4
但是同样的代码在tm
版本0.7.3(R版本3.4.2)中报错:
Error in UseMethod("inspect", x) :
no applicable method for 'inspect' applied to an object of class "list"
根据vignette("tm",package="tm")
,c()
函数重载:
Many standard operators and functions (
[, [<-, [[, [[<-, c(), lapply()
) are available for corpora with semantics similar to standard R routines. E.g.,c()
concatenates two (or more) corpora. Applied to several text documents it returns a corpus. The metadata is automatically updated, if corpora are concatenated (i.e., merged).
但是,对于新版本,情况显然不再如此。 tm
0.7.3 中如何合并两个语料库?一个明显的解决方案是先合并文档,然后再创建语料库,但我正在寻找一种解决方案来合并两个已经存在的语料库。
我对 tm
包没有太多经验,所以我的回答在理解 SimpleCorpus
与 VCorpus
与其他 tm
对象 classes.
您调用 c
的输入是 class SimpleCorpus
;它看起来不像 tm
附带一个专门用于此 class 的 c
方法。所以方法调度并没有调用正确的 c
来按照你想要的方式组合语料库。但是,VCorpus
class (tm:::c.VCorpus
).
c
方法
有两种不同的方法可以解决将 corpus3
强制转换为 list
的问题,但它们似乎会导致不同的结构。我在下面展示了两者,如果它们实现了您的最终目标,则由您决定。
1)定义corpus3
时可以直接调用tm:::c.VCorpus
:
> library(tm)
>
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = Corpus(docs1)
> corpus2 = Corpus(docs2)
>
> corpus3 = tm:::c.VCorpus(corpus1,corpus2)
>
> inspect(corpus3)
<<VCorpus>>
Metadata: corpus specific: 2, document level (indexed): 0
Content: documents: 4
[1] This is the first document. This is the second document. This is the third document.
[4] This is the fourth document.
2)定义corpus1
&corpus2
时可以使用VCorpus
:
> library(tm)
>
> a = "This is the first document."
> b = "This is the second document."
> c = "This is the third document."
> d = "This is the fourth document."
> docs1 = VectorSource(c(a,b))
> docs2 = VectorSource(c(c,d))
> corpus1 = VCorpus(docs1)
> corpus2 = VCorpus(docs2)
>
> corpus3 = c(corpus1,corpus2)
>
> inspect(corpus3)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 4
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 27
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 28
[[3]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 27
[[4]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 28