在 TDM 中加入相邻的单词（标记）以进行整洁的分析

Question

我有包含类似于以下字符串的文档：

    textForAnalysis <- c("non-ifrs earnings numbers are report to be...")

其中我输入了一个语料库

    textCorpus <- Corpus(VectorSource(textForAnalysis))

然后转换为 TDM

    textTDM <- TermDocumentMatrix(textCorpus)

然后将TDM翻译成整洁的格式进行分析

    textTidy <- tidy(textTDM)

当我打印文本时，一切正常，

    textTidy

> textTidy
# A tibble: 6 × 3
      term document count
     <chr>    <chr> <dbl>
1      are        1     1
2 earnings        1     1
3     ifrs        1     1
4      non        1     1
5  numbers        1     1
6   report        1     1

除此之外，我想将 "non-ifrs" 项目保留为单个标记（单词）。我不想将 "non-ifrs" 短语分成 "non" 和 "ifrs"。

如何保持相邻的措辞，例如"non-ifrs" 在我的 analysis/TDM?

中作为单个 "term"（非国际财务报告准则）

Answer 1

TermDocumentMatrix的文档中有一段话可能是关键：

This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost Tokenizer (via Rcpp) and takes no custom functions as option arguments.

你有作业

textCorpus <- Corpus(VectorSource(textForAnalysis))

正如您从 class(textCorpus) 中看到的，该变量是 SimpleCorpus.

的一个实例

请使用V语料库代替语料库：

textCorpus <- VCorpus(VectorSource(textForAnalysis))

现在您可以应用所有必要的控制参数：

textTDM <- TermDocumentMatrix(
  textCorpus, 
  control=list(removePunctuation=list(preserve_intra_word_dashes = TRUE))
)

结果是：

(textTidy <- tidy(textTDM))
# A tibble: 5 × 3
      term document count
     <chr>    <chr> <dbl>
1      are        1     1
2 earnings        1     1
3 non-ifrs        1     1
4  numbers        1     1
5   report        1     1

在 TDM 中加入相邻的单词（标记）以进行整洁的分析

Joining adjacent words (tokens) in a TDM for tidy analysis

r

text-analysis

token