在 TDM 中加入相邻的单词(标记)以进行整洁的分析

Joining adjacent words (tokens) in a TDM for tidy analysis

我有包含类似于以下字符串的文档:

    textForAnalysis <- c("non-ifrs earnings numbers are report to be...")

其中我输入了一个语料库

    textCorpus <- Corpus(VectorSource(textForAnalysis))

然后转换为 TDM

    textTDM <- TermDocumentMatrix(textCorpus)

然后将TDM翻译成整洁的格式进行分析

    textTidy <- tidy(textTDM)

当我打印文本时,一切正常,

    textTidy

> textTidy
# A tibble: 6 × 3
      term document count
     <chr>    <chr> <dbl>
1      are        1     1
2 earnings        1     1
3     ifrs        1     1
4      non        1     1
5  numbers        1     1
6   report        1     1

除此之外,我想将 "non-ifrs" 项目保留为单个标记(单词)。我不想将 "non-ifrs" 短语分成 "non" 和 "ifrs"。

如何保持相邻的措辞,例如"non-ifrs" 在我的 analysis/TDM?

中作为单个 "term"(非国际财务报告准则)

TermDocumentMatrix的文档中有一段话可能是关键:

This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost Tokenizer (via Rcpp) and takes no custom functions as option arguments.

你有作业

textCorpus <- Corpus(VectorSource(textForAnalysis))

正如您从 class(textCorpus) 中看到的,该变量是 SimpleCorpus.

的一个实例

请使用V语料库代替语料库:

textCorpus <- VCorpus(VectorSource(textForAnalysis))

现在您可以应用所有必要的控制参数:

textTDM <- TermDocumentMatrix(
  textCorpus, 
  control=list(removePunctuation=list(preserve_intra_word_dashes = TRUE))
)

结果是:

(textTidy <- tidy(textTDM))
# A tibble: 5 × 3
      term document count
     <chr>    <chr> <dbl>
1      are        1     1
2 earnings        1     1
3 non-ifrs        1     1
4  numbers        1     1
5   report        1     1