从提供 NA 的 TM 包中取消语料库

Question

我有一个使用 TM 包创建的语料库，我在其中应用了所有转换并准备好将其转换回数据框。

当我使用

twit[[1]]$content

我可以看到我的数据。但是，当我尝试取消列出它时，我的所有记录都没有。

twitCln <- data.frame(text=unlist(sapply(twit, '[', "content")), stringsAsFactors=F)

链接问题在唯一具有相同问题但似乎没有解决方案的答案之后进行了讨论。

这是一些可重现的代码。

library(tm)
bbTwit <- as.data.frame(c("Text Line One!", "Text Line 2"), stringsAsFactors = F)
colnames(bbTwit) <- 'Contents'
bbTwit$doc_id <- row.names(bbTwit) 
twit <- bbTwit[c('doc_id','Contents')]
colnames(twit) <- c('doc_id','text')

twit <-Corpus(DataframeSource(data.frame(twit)))
twit <-tm_map(twit, removePunctuation)
twit <-tm_map(twit, stripWhitespace)

twit[[1]]$content

twitCln <- data.frame(text=unlist(sapply(twit, '[', "content")), stringsAsFactors=F)

预期输出将是一个包含 2 个观察值的数据框，其中 "Text Line One" 是第一个记录，"Text Line 2" 是第二个记录。我得到的是 NA

的两个观察结果

Answer 1

根据您对所需输出的描述，听起来您想要

mydf <- data.frame(unlist(twit)[1:(length(unlist(twit))-1)])

content1                              Text Line One
content2                                Text Line 2

其中 row/column 名称当然可以设置为任何你喜欢的 names()。

或者对于一个简单的案例：

rbind(twit[[1]]$content,
           twit[[2]]$content)

[1,] "Text Line One"
[2,] "Text Line 2"

例如

mydf <- data.frame(rbind(twit[[1]]$content,
                 twit[[2]]$content)
)
colnames(mydf) <- "Pretty Column"
mydf

    Pretty Column
1   Text Line One
2   Text Line 2

Answer 2

要取出内容，只需使用content()函数即可。例如

content(twit)
# [1] "Text Line One" "Text Line 2"

或者放在data.frame

data.frame(text=content(twit))
#            text
# 1 Text Line One
# 2   Text Line 2

从提供 NA 的 TM 包中取消语料库

Unlisting Corpus from TM package giving NA's

r

tm